Development Guide
Requirements
- maturin
- Python 3.11
- Java 8+ for PySpark
Installation
(Linux)
maturin build --release
for build a wheelpython3 -m venv .venv
(python3.11 is required)source .venv/bin/activate
pip install target/wheels/data_generation-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl
(choose one for your system)
Generate datasets
(Inside venv
from the previous step)
generator --help
generator --prefix test_data_tiny
(generate tiny data)generator --prefix test_data_small --size small
(generate small data)
Contributing
Contributions are very welcome. I created that benchmark not to prove that one framework is better than other. Also, I'm not related anyhow to any company that develops one or another ETL tool. I have some preferences to Apache Spark because I like it, but results and benchmark is quite fair. For example, I'm not trying to hide how faster are Pandas compared to Spark on small datasets, that are fit into memory.
What would be cool:
- [ ] Implement the same task in DuckDB;
- [ ] Implement the same task in Polars;
- [ ] Implement the same task in Dusk;
- [ ] Implement different approaches for
Pandas
; - [ ] Implement different approaches for
Spark
; - [ ] Setup CI to run benchmarks on GH Runners instead of my laptop;
- [ ] ???
There is a lack of documentation for now, but I'm working on it. You may open an issue, open a PR or just contact me via email: mailto:ssinchenko@apache.org.