Overview
spark-second-string provides Spark-native similarity metrics for string-heavy workloads such as identity resolution, entity matching, and duplicate detection.
The core idea is to keep similarity scoring inside Spark execution, so teams can use broad and explainable string heuristics as an inexpensive stage before expensive model-based matching.
Where it fits?
Most large-scale entity-resolution / identity-resolution pipelines have the same four stages. The hard scale work is in stages 1–3; stage 4 stops being a scale problem because each cluster is independent and small.
1. Blocking — narrow the candidate space
You don't want all-vs-all comparison. Use cheap, exact keys to bucket records and only score within the bucket. Spark built-ins are enough:
-- Block by digits-only key (phone numbers, IDs with separators)
SELECT *, regexp_replace(phone, '[^0-9]', '') AS block_key FROM customers
-- Block by first 4 letters of normalized last name
SELECT *, substring(lower(last_name), 1, 4) AS block_key FROM customers
-- Block by leading digit (account numbers, postal codes)
SELECT *, substring(account, 1, 1) AS block_key
FROM accounts
WHERE account RLIKE '^[0-9]'
Then JOIN ... ON l.block_key = r.block_key produces candidate pairs without the quadratic blow-up.
2. Scoring — fuzzy similarity inside the block
This is where spark-second-string lives. For each candidate pair, compute one or more similarity scores and keep pairs above a threshold:
SELECT l.id, r.id, ss_jaro_winkler(l.name, r.name) AS name_sim
FROM pairs
WHERE ss_jaro_winkler(l.name, r.name) > 0.85
Multiple metrics are cheap — every metric is a single Catalyst expression that fuses into the same whole-stage codegen block as the surrounding filter.
3. Clustering — connect pairs into entities
The pairs above your threshold form a graph; connected components groups them into entities. GraphFrames connectedComponents implements WCC with algorithms tuned for the long transitive chains typical of ER (A↔B, B↔C, but no direct A↔C edge — you still want them in one cluster). Same author maintains GraphFrames and this library; the two are designed to compose.
4. Within-cluster pruning — do whatever
Once you have clusters, they're small and independent. Apply business rules, manual review, per-cluster ranking, an LLM call — whatever you want. No longer a scale problem.
Why a dedicated library for step 2?
| Approach | Pros | Cons |
|---|---|---|
| Custom Scala UDFs / wrapped Java libs (e.g., SecondString in a UDF) | Easy to write; reuse existing implementations | Crosses the Tungsten boundary every row; per-row scratch allocation; no codegen fusion with surrounding filters |
| Python NLP libs (rapidfuzz, jellyfish) in vectorized UDFs | Rich algorithm catalogue; familiar to DS teams | Off-heap memory pressure; row → Arrow columnar → row marshalling around every call; harder to size executors |
| Spark built-ins only | Native, fast, no dependencies | Only levenshtein (raw Int distance, no normalization) and soundex; everything else has to be assembled from array primitives — see below |
spark-second-string |
Native Catalyst expressions with codegen and ThreadLocal buffer reuse; 13 similarity metrics + phonetic codecs; fuzz-tested against SecondString as reference oracle | Slower than C-backed Python libs in pure micro-benchmarks; smaller algorithm catalogue than full NLP stacks |
What "built-ins only" actually looks like
In theory you can express Jaccard with array primitives. In practice, even character-trigram Jaccard — a one-line definition mathematically — becomes:
SELECT
size(array_intersect(
transform(sequence(1, length(a) - 2), i -> substring(a, i, 3)),
transform(sequence(1, length(b) - 2), i -> substring(b, i, 3))
))
/
size(array_union(
transform(sequence(1, length(a) - 2), i -> substring(a, i, 3)),
transform(sequence(1, length(b) - 2), i -> substring(b, i, 3))
)) AS jaccard
FROM pairs
…which:
- breaks when
length(a) < 3orlength(b) < 3(you get an emptysequence, then division by zero), - tokenizes each input twice because Catalyst can't always common-subexpression-eliminate
transform, - offers no parameter for n-gram size beyond rewriting the SQL,
- gives no control over case folding, whitespace normalization, or empty-input semantics.
Word-level Jaccard via split is slightly less ugly but still naive on whitespace and equally devoid of options.
The equivalent with this library:
-- SQL: whitespace-tokenized words, defaults
SELECT ss_jaccard(a, b) FROM pairs
// Scala DSL: character trigrams via the parametric overload
StringSimilarityFunctions.jaccard(col("a"), col("b"), 3)
For Jaro, Jaro-Winkler, Smith-Waterman, Needleman-Wunsch, Monge-Elkan, and the rest there is no built-in at all — a UDF is the only fallback, which puts you back in row 1 of the table.
Design principles
The library exposes two intentionally different surfaces, tuned to two different audiences.
Flow A — SQL / expr(...) for SQL and PySpark users
Two-argument SQL functions with sane defaults — ss_jaccard(a, b), ss_jaro_winkler(a, b), ss_levenshtein(a, b), and the rest of the ss_* family. No knobs. The defaults are the common-case choices: Jaccard tokenizes on whitespace, Jaro-Winkler uses the canonical prefix scale and cap, Monge-Elkan uses Jaro-Winkler as the inner metric, alignment metrics use standard match / mismatch / gap costs.
Two ways to register:
--conf spark.sql.extensions=io.github.semyonsinchenko.sparkss.sql.SparkSecondStringExtensionat cluster or session bootstrap. Works for SQL-only and PySpark setups with no driver-side Scala code.spark.registerStringSimilarityFunctions()from Scala, after importing the implicit class fromStringSimilaritySparkSessionExtensions.io.github.semyonsinchenko.sparkss.sql.StringSimilaritySparkSessionExtensions.registerAllFunctionsPy4j()from Py4j only if you cannot change spark'sconfby any reason. Keep in mind that this method is unsafe and has side effects. It assumes that theSparkSessionexists, that all the names are free and it modifies the existingSparkSession'sfunctionRegistryunder the hood. I added this endpoint only to siplify the life of the PySpark users on some vendor-serverless-bla-bla deployment where the option of config's modification does not exists at all but user wants for any reason to use SQL-expressions instead of the DSL.
spark._jvm.io.github.semyonsinchenko.sparkss.sql.StringSimilaritySparkSessionExtensions.registerAllFunctionsPy4j()
Once registered, PySpark users can call any function through F.expr("ss_jaccard(a, b)") like a native SQL function.
Flow B — Scala DSL StringSimilarityFunctions for library developers
For JVM consumers embedding this library inside larger ER systems (Zingg-style, Splink-style, or custom in-house pipelines): every metric has parametric overloads — ngramSize, prefixScale / prefixCap, matchScore / mismatchPenalty / gapPenalty, Monge-Elkan innerMetric, affine-gap open / extend penalties — reachable from Scala or Java.
StringSimilarityFunctions.jaccard(col("a"), col("b"), 3) // character trigrams
StringSimilarityFunctions.smithWaterman(col("a"), col("b"), 2, -1, -1) // tuned alignment
A note on py4j
The Scala DSL was designed with py4j ergonomics in mind, even though a published PySpark wrapper is not part of this release. Concretely:
- Every public method parameter is a Java primitive (
int,double),String, ororg.apache.spark.sql.Column. - No
Option,Seq,Map,Tuple, type parameters, or implicit parameter lists. - "Default" forms are explicit zero-arg overloads, not Scala default arguments — py4j picks the right overload by arity without
apply$default$Nsynthetic gymnastics.
A thin in-house Python wrapper is roughly ten lines per metric:
from pyspark.sql.column import Column, _to_java_column
ss = spark.sparkContext._jvm.io.github.semyonsinchenko.sparkss.StringSimilarityFunctions
score = Column(ss.jaccard(_to_java_column(left), _to_java_column(right), 3))
py4j compatibility is treated as a deliberate design constraint but is not part of the formal CI test matrix — treat as best-effort.