Development Guide

Requirements

maturin
Python 3.11
Java 8+ for PySpark

Installation

(Linux)

maturin build --release for build a wheel
python3 -m venv .venv (python3.11 is required)
source .venv/bin/activate
pip install target/wheels/data_generation-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl (choose one for your system)

Generate datasets

(Inside venv from the previous step)

generator --help
generator --prefix test_data_tiny (generate tiny data)
generator --prefix test_data_small --size small (generate small data)

Contributing

Contributions are very welcome. I created that benchmark not to prove that one framework is better than other. Also, I'm not related anyhow to any company that develops one or another ETL tool. I have some preferences to Apache Spark because I like it, but results and benchmark is quite fair. For example, I'm not trying to hide how faster are Pandas compared to Spark on small datasets, that are fit into memory.

What would be cool:

[ ] Implement the same task in DuckDB;
[ ] Implement the same task in Polars;
[ ] Implement the same task in Dusk;
[ ] Implement different approaches for Pandas;
[ ] Implement different approaches for Spark;
[ ] Setup CI to run benchmarks on GH Runners instead of my laptop;
[ ] ???

There is a lack of documentation for now, but I'm working on it. You may open an issue, open a PR or just contact me via email: mailto:ssinchenko@apache.org.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search