Benchmarks
Performance comparison of Spark-native Catalyst expressions vs. equivalent UDF wrappers around the Java SecondString library. Higher ops/s is better; the diff column shows the relative throughput change (negative means the native implementation is faster).
Summary
Algorithms compared: 5
Best relative delta (closest to parity): jaro_winkler (-34.00%)
| Algorithm | spark-native | UDF | diff |
|---|---|---|---|
| affine_gap | 11.00 +/- 1.71 ops/s | 5.62 +/- 0.88 ops/s | -48.94% |
| jaro_winkler | 20.03 +/- 4.57 ops/s | 13.22 +/- 2.64 ops/s | -34.00% |
| monge_elkan | 11.73 +/- 2.51 ops/s | 5.04 +/- 1.02 ops/s | -57.01% |
| needleman_wunsch | 15.42 +/- 3.40 ops/s | 7.27 +/- 1.48 ops/s | -52.81% |
| smith_waterman | 14.46 +/- 2.38 ops/s | 7.22 +/- 1.54 ops/s | -50.09% |
How to read the table
- spark-native: throughput of the Catalyst code-generated expression (ops/s with standard deviation).
- UDF: throughput of a Spark UDF calling the equivalent Java SecondString method.
- diff: relative throughput change. A negative value means the native path is faster by that percentage.
Reproducing
Run the benchmark suite and regenerate the comparison table:
./dev/benchmarks_suite.sh --mode compare-only
Artifact source: benchmarks/target/reports/suite/compare-table.txt