Benchmarks
Performance comparison of Spark-native Catalyst expressions vs. equivalent UDF wrappers around the Java SecondString library. Higher ops/s is better; the diff column shows the relative throughput change (negative means the native implementation is faster).
Summary
Algorithms compared: 5
Best relative delta (closest to parity): jaro_winkler (-33.86%)
| Algorithm | spark-native | UDF | diff |
|---|---|---|---|
| affine_gap | 11.30 +/- 2.20 ops/s | 5.80 +/- 1.01 ops/s | -48.68% |
| jaro_winkler | 19.59 +/- 4.77 ops/s | 12.95 +/- 2.73 ops/s | -33.86% |
| monge_elkan | 11.45 +/- 1.94 ops/s | 4.90 +/- 0.66 ops/s | -57.20% |
| needleman_wunsch | 14.76 +/- 3.00 ops/s | 6.68 +/- 0.98 ops/s | -54.78% |
| smith_waterman | 12.90 +/- 3.06 ops/s | 7.07 +/- 1.23 ops/s | -45.21% |
How to read the table
- spark-native: throughput of the Catalyst code-generated expression (ops/s with standard deviation).
- UDF: throughput of a Spark UDF calling the equivalent Java SecondString method.
- diff: relative throughput change. A negative value means the native path is faster by that percentage.
Reproducing
Run the benchmark suite and regenerate the comparison table:
./dev/benchmarks_suite.sh --mode compare-only
Artifact source: benchmarks/target/reports/suite/compare-table.txt