Fuzzy Testing
Parity comparison between Spark-native implementations and the Java SecondString library. Both implementations are run on the same randomly generated input pairs and their output scores are compared row by row.
Summary
Total compared rows: 197509
Overall rows within +-5% agreement: 74.71%
Metric with lowest >30% drift: smith_waterman (0.00%)
| Metric | Rows | Pearson | Spearman | +-5% | +-10% | +-30% | >30% |
|---|---|---|---|---|---|---|---|
| needleman_wunsch | 50000 | 0.999108 | 0.996912 | 83.96% | 7.13% | 7.34% | 1.57% |
| smith_waterman | 50000 | 1.000000 | 1.000000 | 100.00% | 0.00% | 0.00% | 0.00% |
| jaro_winkler | 50000 | 0.979174 | 0.991773 | 93.91% | 2.62% | 0.63% | 2.85% |
| monge_elkan | 47509 | 0.814753 | 0.856769 | 18.14% | 2.68% | 21.19% | 57.99% |
How to read the table
- Pearson / Spearman: correlation coefficients between the native and reference scores. Values close to 1.0 indicate strong agreement.
- +-5% / +-10% / +-30%: percentage of rows where the absolute difference between the two scores falls within that tolerance band.
- >30%: rows with more than 30% absolute difference, indicating significant divergence.
High >30% counts typically indicate a known algorithmic difference rather than a bug (e.g. Monge-Elkan uses a symmetric average while SecondString uses a one-directional score).
Reproducing
sbt "fuzzy-testing/runMain io.github.semyonsinchenko.sparkss.fuzzy.FuzzyTestingCli \
--seed 42 --rows 100000 \
--out target/reports/fuzzy-report.md \
--save-output target/reports/fuzzy-csv"
Artifact source: fuzzy-testing/target/reports/fuzzy-report.md