Supported Metrics
All similarity metrics return a Double in the range [0.0, 1.0] where 1.0 means identical and 0.0 means
completely different. Phonetic encoders return a String encoding. Every metric is available through both the DataFrame
DSL (StringSimilarityFunctions) and Spark SQL (StringSimilaritySparkSessionExtensions).
Token-based metrics
Token metrics split input strings into token sets and measure set overlap. By default tokens are whitespace-separated
words. All token metrics accept an optional ngramSize parameter; when set to a value greater than zero the input is
split into character n-grams instead.
Jaccard
Set intersection over set union.
Formula: |A ∩ B| / |A ∪ B|
| DSL | jaccard(left, right) / jaccard(left, right, ngramSize) |
| SQL | jaccard(left, right) |
| Parameters | ngramSize: Int (default 0 = whitespace tokens, >0 = character n-grams) |
Sorensen-Dice
Doubled intersection over the sum of set sizes. Emphasizes overlap more than Jaccard.
Formula: 2 * |A ∩ B| / (|A| + |B|)
| DSL | sorensenDice(left, right) / sorensenDice(left, right, ngramSize) |
| SQL | sorensen_dice(left, right) |
| Parameters | ngramSize: Int (default 0) |
Overlap Coefficient
Intersection relative to the smaller set. A value of 1.0 means one token set is a subset of the other.
Formula: |A ∩ B| / min(|A|, |B|)
| DSL | overlapCoefficient(left, right) / overlapCoefficient(left, right, ngramSize) |
| SQL | overlap_coefficient(left, right) |
| Parameters | ngramSize: Int (default 0) |
Cosine
Token-set cosine similarity (binary term vectors).
Formula: |A ∩ B| / sqrt(|A| * |B|)
| DSL | cosine(left, right) / cosine(left, right, ngramSize) |
| SQL | cosine(left, right) |
| Parameters | ngramSize: Int (default 0) |
Braun-Blanquet
Intersection relative to the larger set. Stricter than Overlap Coefficient because it penalizes size differences.
Formula: |A ∩ B| / max(|A|, |B|)
| DSL | braunBlanquet(left, right) / braunBlanquet(left, right, ngramSize) |
| SQL | braun_blanquet(left, right) |
| Parameters | ngramSize: Int (default 0) |
Monge-Elkan
A hybrid token metric. Each token in the left string is matched to its best-scoring counterpart in the right string using a character-level inner metric, and the scores are averaged symmetrically (left-to-right and right-to-left).
| DSL | monge_elkan(left, right) / monge_elkan(left, right, innerMetric) / monge_elkan(left, right, innerMetric, ngramSize) |
| SQL | monge_elkan(left, right) |
| Parameters | innerMetric: String (default "jaro_winkler", also accepts "jaro", "levenshtein", "needleman_wunsch", "smith_waterman"), ngramSize: Int (default 0) |
Matrix / edit-distance metrics
These metrics operate at the character level using dynamic-programming alignment algorithms. All results are normalized to [0.0, 1.0].
Levenshtein
Minimum number of single-character insertions, deletions, or substitutions to transform one string into the other, normalized by the longer string length.
Formula: 1 - editDistance / max(|left|, |right|)
| DSL | levenshtein(left, right) |
| SQL | levenshtein(left, right) |
| Parameters | none |
LCS Similarity
Longest Common Subsequence length (order-preserving, not necessarily contiguous) normalized by the longer string length.
Formula: lcsLength / max(|left|, |right|)
| DSL | lcsSimilarity(left, right) |
| SQL | lcs_similarity(left, right) |
| Parameters | none |
Jaro
Counts matching characters within a sliding window and transpositions (matches in different order).
Formula: (m/|s1| + m/|s2| + (m - t/2)/m) / 3 where m = matches, t = transpositions
| DSL | jaro(left, right) |
| SQL | jaro(left, right) |
| Parameters | none |
Jaro-Winkler
Extends Jaro with a bonus for a common prefix, making it especially effective for strings that share early characters ( e.g. name typos).
Formula: jaro + prefixLength * prefixScale * (1 - jaro)
| DSL | jaroWinkler(left, right) / jaroWinkler(left, right, prefixScale, prefixCap) |
| SQL | jaro_winkler(left, right) |
| Parameters | prefixScale: Double (default 0.1, range (0, 0.25]), prefixCap: Int (default 4, range [1, 10]) |
Needleman-Wunsch
Global sequence alignment: aligns entire strings end-to-end, penalizing every gap and mismatch.
Normalization: (rawScore + maxLength) / (2 * maxLength)
| DSL | needlemanWunsch(left, right) / needlemanWunsch(left, right, matchScore, mismatchPenalty, gapPenalty) |
| SQL | needleman_wunsch(left, right) |
| Parameters | matchScore: Int (default 1, >0), mismatchPenalty: Int (default -1, <0), gapPenalty: Int (default -1, <0) |
Smith-Waterman
Local sequence alignment: finds the best-matching substring pair, ignoring unrelated regions at the ends.
Normalization: rawScore / (matchScore * min(|left|, |right|))
| DSL | smithWaterman(left, right) / smithWaterman(left, right, matchScore, mismatchPenalty, gapPenalty) |
| SQL | smith_waterman(left, right) |
| Parameters | matchScore: Int (default 2, >0), mismatchPenalty: Int (default -1, <=0), gapPenalty: Int (default -1, <=0) |
Affine Gap
Sequence alignment with affine gap penalties: opening a gap is more expensive than extending one, which better models real-world string variations where insertions and deletions tend to cluster.
Gap cost: gapOpenPenalty + gapLength * gapExtendPenalty
Normalization: 1 - distance / max(|left|, |right|)
| DSL | affine_gap(left, right) / affine_gap(left, right, mismatchPenalty, gapOpenPenalty, gapExtendPenalty) |
| SQL | affine_gap(left, right) |
| Parameters | mismatchPenalty: Int (default -1, <0), gapOpenPenalty: Int (default -2, <0), gapExtendPenalty: Int (default -1, <0) |
Phonetic encoders
Phonetic encoders convert a string into a code that represents its pronunciation. Two strings that sound alike produce
the same (or similar) code. These are unary functions (single input column) and return a String.
Soundex
American Soundex algorithm. Produces a 4-character code: the first letter followed by three digits derived from consonant groups.
| DSL | soundex(input) |
| SQL | ss_soundex(input) |
Refined Soundex
NARA-variant Soundex with a finer consonant mapping and variable-length output. More discriminative than standard Soundex.
| DSL | refinedSoundex(input) |
| SQL | ss_refined_soundex(input) |
Double Metaphone
Generates a phonetic code that handles diverse language origins better than Soundex. Returns the primary code (up to 4 characters).
| DSL | doubleMetaphone(input) |
| SQL | ss_double_metaphone(input) |
Tokenization modes
All token-based metrics support two tokenization modes controlled by the ngramSize parameter:
ngramSize |
Mode | Example for "hello world" |
|---|---|---|
| 0 (default) | Whitespace | {"hello", "world"} |
| 2 | Character bigrams | {"he", "el", "ll", "lo", "o ", " w", "wo", "or", "rl", "ld"} |
| 3 | Character trigrams | {"hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"} |
Character n-gram tokenization is useful when inputs are single tokens without natural word boundaries (e.g. company names, product codes).
Configurable parameters
SQL similarity functions remain two-argument for compatibility. To use configurable parameters (scoring weights, prefix
scale, inner metric, n-gram size), use the StringSimilarityFunctions DSL overloads.
| Metric | Configurable Parameters |
|---|---|
| Jaro-Winkler | prefixScale, prefixCap |
| Needleman-Wunsch | matchScore, mismatchPenalty, gapPenalty |
| Smith-Waterman | matchScore, mismatchPenalty, gapPenalty |
| Affine Gap | mismatchPenalty, gapOpenPenalty, gapExtendPenalty |
| Monge-Elkan | innerMetric, ngramSize |
| All token metrics | ngramSize |