Results of Benchmark

Results

Setup

EC2 m5.4xlarge

Key	Value
vCPUs	16
Memory (GiB)	64.0
Memory per vCPU (GiB)	4.0
Physical Processor	Intel Xeon Platinum 8175
Clock Speed (GHz)	3.1
CPU Architecture	x86_64
Disk space	256 Gb

Details about the instance type

Datasets

All the information provided for a default seed 42. Size on disk is a total size of compressed parquet files. An amount of rows depends of SEED that was used for generation of the data because an amount of transactions for each id (customer id) per day is sampled from binomial distribution.

See src/lib.rs for details of the implementation.

Tiny Dataset

Amount of rows: 17,299,455

Size on disk: 178 Mb

Unique IDs: 1,000

Tool	Time of processing in seconds
Pandas pivot	26.55
DuckDB pivot	1.10
PySpark pivot	36.14
PySpark Comet pivot	32.80
DuckDB pivot known values	0.81
Polars pivot lazy	3.68

Small Dataset

Amount of rows: 172,925,732

Size on disk: 1.8 Gb

Unique IDs: 10,000

Tool	Time of processing in seconds
Pandas pivot	233.18
DuckDB pivot	9.87
PySpark pivot	62.25
PySpark Comet pivot	52.20
Polars pivot lazy	35.06
DuckDB pivot known values	6.03

Medium Dataset

Amount of rows: 1,717,414,863

Size on disk: 18 Gb

Unique IDs: 100,000

Tool	Time of processing in seconds
DuckDB pivot	153.57
PySpark pivot	331.40
PySpark Comet pivot	250.47
DuckDB pivot known values	95.07

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search