Quick Start

Installation

Add the library artifact to your Spark application build. The artifact suffix follows your Spark minor line.

Spark line Scala binary Artifact name
3.5.x 2.12 spark-second-string-3.5
4.0.x 2.13 spark-second-string-4.0

Usage Flow A: Direct DataFrame API

import io.github.semyonsinchenko.sparkss.StringSimilarityFunctions
import org.apache.spark.sql.functions.col

val scored = pairs
  .withColumn("jw", StringSimilarityFunctions.jaroWinkler(col("left_name"), col("right_name")))
  .withColumn("sw", StringSimilarityFunctions.smithWaterman("left_name", "right_name"))

Usage Flow B: Spark SQL Extension Functions

Recommended for cluster/session bootstrap (Spark SQL and PySpark):

--conf spark.sql.extensions=io.github.semyonsinchenko.sparkss.sql.SparkSecondStringExtension
import io.github.semyonsinchenko.sparkss.sql.StringSimilaritySparkSessionExtensions._

spark.registerStringSimilarityFunctions()

val scored = spark.sql(
  """
    |SELECT id,
    |       ss_jaro_winkler(left_name, right_name) AS jw,
    |       ss_smith_waterman(left_name, right_name) AS sw,
    |       ss_soundex(left_name) AS left_soundex
    |FROM candidate_pairs
    |""".stripMargin
)

PySpark startup example:

spark = (
    SparkSession.builder
    .config(
        "spark.sql.extensions",
        "io.github.semyonsinchenko.sparkss.sql.SparkSecondStringExtension",
    )
    .getOrCreate()
)

spark.sql("SELECT ss_jaro_winkler('martha', 'marhta') AS score").show()

Note: SQL similarity functions stay two-argument for compatibility; configurable metric parameters and ngramSize are available via StringSimilarityFunctions DSL overloads.

Docs Build

Docs pages consume generated benchmark and fuzzy-testing variables.

Required pre-runs:

./dev/benchmarks_suite.sh --mode compare-only
sbt "fuzzy-testing/runMain io.github.semyonsinchenko.sparkss.fuzzy.FuzzyTestingCli --seed 42 --rows 100000 --out target/reports/fuzzy-report.md --save-output target/reports/fuzzy-csv"

Build docs:

sbt docs/laikaSite