Quick Start
Installation
Add the library artifact to your Spark application build. The artifact suffix follows your Spark minor line.
| Spark line | Scala binary | Artifact name |
|---|---|---|
| 3.5.x | 2.12 | spark-second-string-3.5 |
| 4.0.x | 2.13 | spark-second-string-4.0 |
Usage Flow A: Direct DataFrame API
import io.github.semyonsinchenko.sparkss.StringSimilarityFunctions
import org.apache.spark.sql.functions.col
val scored = pairs
.withColumn("jw", StringSimilarityFunctions.jaroWinkler(col("left_name"), col("right_name")))
.withColumn("sw", StringSimilarityFunctions.smithWaterman("left_name", "right_name"))
Usage Flow B: Spark SQL Extension Functions
import io.github.semyonsinchenko.sparkss.sql.StringSimilaritySparkSessionExtensions._
spark.registerStringSimilarityFunctions()
val scored = spark.sql(
"""
|SELECT id,
| jaro_winkler(left_name, right_name) AS jw,
| smith_waterman(left_name, right_name) AS sw,
| ss_soundex(left_name) AS left_soundex
|FROM candidate_pairs
|""".stripMargin
)
Note: SQL similarity functions stay two-argument for compatibility; configurable metric parameters and ngramSize are available via StringSimilarityFunctions DSL overloads.
Docs Build
Docs pages consume generated benchmark and fuzzy-testing variables.
Required pre-runs:
./dev/benchmarks_suite.sh --mode compare-only
sbt "fuzzy-testing/runMain io.github.semyonsinchenko.sparkss.fuzzy.FuzzyTestingCli --seed 42 --rows 100000 --out target/reports/fuzzy-report.md --save-output target/reports/fuzzy-csv"
Build docs:
sbt docs/laikaSite