Quick Start
Installation
Add the library artifact to your Spark application build. The artifact suffix follows your Spark minor line.
| Spark line | Scala binary | Artifact name |
|---|---|---|
| 3.5.x | 2.12 | spark-second-string-3.5 |
| 4.0.x | 2.13 | spark-second-string-4.0 |
Usage Flow A: Direct DataFrame API
import io.github.semyonsinchenko.sparkss.StringSimilarityFunctions
import org.apache.spark.sql.functions.col
val scored = pairs
.withColumn("jw", StringSimilarityFunctions.jaroWinkler(col("left_name"), col("right_name")))
.withColumn("sw", StringSimilarityFunctions.smithWaterman("left_name", "right_name"))
Usage Flow B: Spark SQL Extension Functions
Recommended for cluster/session bootstrap (Spark SQL and PySpark):
--conf spark.sql.extensions=io.github.semyonsinchenko.sparkss.sql.SparkSecondStringExtension
import io.github.semyonsinchenko.sparkss.sql.StringSimilaritySparkSessionExtensions._
spark.registerStringSimilarityFunctions()
val scored = spark.sql(
"""
|SELECT id,
| ss_jaro_winkler(left_name, right_name) AS jw,
| ss_smith_waterman(left_name, right_name) AS sw,
| ss_soundex(left_name) AS left_soundex
|FROM candidate_pairs
|""".stripMargin
)
PySpark startup example:
spark = (
SparkSession.builder
.config(
"spark.sql.extensions",
"io.github.semyonsinchenko.sparkss.sql.SparkSecondStringExtension",
)
.getOrCreate()
)
spark.sql("SELECT ss_jaro_winkler('martha', 'marhta') AS score").show()
Note: SQL similarity functions stay two-argument for compatibility; configurable metric parameters and ngramSize are available via StringSimilarityFunctions DSL overloads.
Docs Build
Docs pages consume generated benchmark and fuzzy-testing variables.
Required pre-runs:
./dev/benchmarks_suite.sh --mode compare-only
sbt "fuzzy-testing/runMain io.github.semyonsinchenko.sparkss.fuzzy.FuzzyTestingCli --seed 42 --rows 100000 --out target/reports/fuzzy-report.md --save-output target/reports/fuzzy-csv"
Build docs:
sbt docs/laikaSite