Benchmarks

Performance comparison of Spark-native Catalyst expressions vs. equivalent UDF wrappers around the Java SecondString library. Higher ops/s is better; the diff column shows the relative throughput change (negative means the native implementation is faster).

Summary

Algorithms compared: 5

Best relative delta (closest to parity): jaro_winkler (-33.86%)

Algorithm	spark-native	UDF	diff
affine_gap	11.30 +/- 2.20 ops/s	5.80 +/- 1.01 ops/s	-48.68%
jaro_winkler	19.59 +/- 4.77 ops/s	12.95 +/- 2.73 ops/s	-33.86%
monge_elkan	11.45 +/- 1.94 ops/s	4.90 +/- 0.66 ops/s	-57.20%
needleman_wunsch	14.76 +/- 3.00 ops/s	6.68 +/- 0.98 ops/s	-54.78%
smith_waterman	12.90 +/- 3.06 ops/s	7.07 +/- 1.23 ops/s	-45.21%

How to read the table

spark-native: throughput of the Catalyst code-generated expression (ops/s with standard deviation).
UDF: throughput of a Spark UDF calling the equivalent Java SecondString method.
diff: relative throughput change. A negative value means the native path is faster by that percentage.

Reproducing

Run the benchmark suite and regenerate the comparison table:

./dev/benchmarks_suite.sh --mode compare-only

Artifact source: benchmarks/target/reports/suite/compare-table.txt