object StringSimilarityFunctions
- Alphabetic
- By Inheritance
- StringSimilarityFunctions
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- def affineGap(left: String, right: String, mismatchPenalty: Int, gapOpenPenalty: Int, gapExtendPenalty: Int): Column
Affine-gap sequence alignment similarity (string column name variant).
Affine-gap sequence alignment similarity (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.Penalty parameters use the same sign convention as Needleman-Wunsch and Smith-Waterman: mismatch/open/extend penalties must be negative values.
Positive penalty values are rejected at analysis time with a fail-fast type-check error.
- left
left input string column name
- right
right input string column name
- mismatchPenalty
penalty applied to aligned non-matching characters (must be negative)
- gapOpenPenalty
penalty applied when opening a gap (must be negative)
- gapExtendPenalty
penalty applied when extending an existing gap (must be negative)
- returns
alignment-based similarity score where higher is more similar
- def affineGap(left: String, right: String): Column
Affine-gap sequence alignment similarity with default penalty values (string column name variant).
Affine-gap sequence alignment similarity with default penalty values (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.Penalty parameters use the same sign convention as Needleman-Wunsch and Smith-Waterman: mismatch/open/extend penalties must be negative values.
Positive penalty values are rejected at analysis time with a fail-fast type-check error.
- left
left input string column name
- right
right input string column name
- returns
alignment-based similarity score where higher is more similar
- def affineGap(left: Column, right: Column, mismatchPenalty: Int, gapOpenPenalty: Int, gapExtendPenalty: Int): Column
Affine-gap sequence alignment similarity.
Affine-gap sequence alignment similarity.
Penalty parameters use the same sign convention as Needleman-Wunsch and Smith-Waterman: mismatch/open/extend penalties must be negative values.
Positive penalty values are rejected at analysis time with a fail-fast type-check error.
- left
left input string column
- right
right input string column
- mismatchPenalty
penalty applied to aligned non-matching characters (must be negative)
- gapOpenPenalty
penalty applied when opening a gap (must be negative)
- gapExtendPenalty
penalty applied when extending an existing gap (must be negative)
- returns
alignment-based similarity score where higher is more similar
- def affineGap(left: Column, right: Column): Column
Affine-gap sequence alignment similarity with default penalty values.
Affine-gap sequence alignment similarity with default penalty values.
Convenience overload that delegates to the penalty-tuning
Columnvariant using the defaultmismatchPenalty,gapOpenPenalty, andgapExtendPenaltysettings.Penalty parameters use the same sign convention as Needleman-Wunsch and Smith-Waterman: mismatch/open/extend penalties must be negative values.
Positive penalty values are rejected at analysis time with a fail-fast type-check error.
- left
left input string column
- right
right input string column
- returns
alignment-based similarity score where higher is more similar
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def braunBlanquet(left: String, right: String, ngramSize: Int): Column
Braun-Blanquet similarity between two strings using custom tokenization n-gram size (string column name variant).
Braun-Blanquet similarity between two strings using custom tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def braunBlanquet(left: String, right: String): Column
Braun-Blanquet similarity between two strings (string column name variant).
Braun-Blanquet similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def braunBlanquet(left: Column, right: Column, ngramSize: Int): Column
Braun-Blanquet similarity between two strings using custom tokenization n-gram size.
Braun-Blanquet similarity between two strings using custom tokenization n-gram size.
- left
left input string column
- right
right input string column
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def braunBlanquet(left: Column, right: Column): Column
Braun-Blanquet similarity between two strings.
Braun-Blanquet similarity between two strings.
Computes token intersection relative to the larger token set. The result is in
[0.0, 1.0], where1.0means identical token sets.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
- def cosine(left: String, right: String, ngramSize: Int): Column
Cosine similarity between two strings using custom tokenization n-gram size (string column name variant).
Cosine similarity between two strings using custom tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def cosine(left: String, right: String): Column
Cosine similarity between two strings (string column name variant).
Cosine similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def cosine(left: Column, right: Column, ngramSize: Int): Column
Cosine similarity between two strings using custom tokenization n-gram size.
Cosine similarity between two strings using custom tokenization n-gram size.
- left
left input string column
- right
right input string column
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def cosine(left: Column, right: Column): Column
Cosine similarity between two strings.
Cosine similarity between two strings.
Compares token vectors by angle. The result is in
[0.0, 1.0], where higher values indicate more similar token distributions.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def doubleMetaphone(inputColName: String): Column
Double Metaphone phonetic encoding (string column name variant).
Double Metaphone phonetic encoding (string column name variant).
Convenience overload that resolves the given column name and delegates to the
Columnvariant.- inputColName
input string column name
- returns
column expression producing the primary Double Metaphone code for the input string
- def doubleMetaphone(input: Column): Column
Double Metaphone phonetic encoding.
Double Metaphone phonetic encoding.
- input
input column
- returns
column expression producing the primary Double Metaphone code for the input string
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def jaccard(left: String, right: String, ngramSize: Int): Column
Jaccard similarity between two strings using custom tokenization n-gram size (string column name variant).
Jaccard similarity between two strings using custom tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def jaccard(left: String, right: String): Column
Jaccard similarity between two strings (string column name variant).
Jaccard similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def jaccard(left: Column, right: Column, ngramSize: Int): Column
Jaccard similarity between two strings using custom tokenization n-gram size.
Jaccard similarity between two strings using custom tokenization n-gram size.
- left
left input string column
- right
right input string column
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def jaccard(left: Column, right: Column): Column
Jaccard similarity between two strings.
Jaccard similarity between two strings.
Compares token overlap divided by token union size. The result is in
[0.0, 1.0], where1.0means identical token sets and0.0means no shared tokens.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def jaro(left: String, right: String): Column
Jaro similarity between two strings (string column name variant).
Jaro similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def jaro(left: Column, right: Column): Column
Jaro similarity between two strings.
Jaro similarity between two strings.
Scores agreement in matching characters and transpositions. The result is in
[0.0, 1.0], where1.0means an exact match.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def jaroWinkler(left: String, right: String, prefixScale: Double, prefixCap: Int): Column
Jaro-Winkler similarity between two strings with custom prefix tuning (string column name variant).
Jaro-Winkler similarity between two strings with custom prefix tuning (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- prefixScale
weight of the common-prefix bonus
- prefixCap
maximum prefix length eligible for the bonus
- returns
similarity score in
[0.0, 1.0]
- def jaroWinkler(left: String, right: String): Column
Jaro-Winkler similarity between two strings (string column name variant).
Jaro-Winkler similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def jaroWinkler(left: Column, right: Column, prefixScale: Double, prefixCap: Int): Column
Jaro-Winkler similarity between two strings with custom prefix tuning.
Jaro-Winkler similarity between two strings with custom prefix tuning.
- left
left input string column
- right
right input string column
- prefixScale
weight of the common-prefix bonus
- prefixCap
maximum prefix length eligible for the bonus
- returns
similarity score in
[0.0, 1.0]
- def jaroWinkler(left: Column, right: Column): Column
Jaro-Winkler similarity between two strings.
Jaro-Winkler similarity between two strings.
Extends Jaro with a prefix bonus so early-character agreement increases similarity. The result is in
[0.0, 1.0], where1.0means an exact match.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def lcsSimilarity(left: String, right: String): Column
Longest-common-subsequence (LCS) similarity between two strings (string column name variant).
Longest-common-subsequence (LCS) similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def lcsSimilarity(left: Column, right: Column): Column
Longest-common-subsequence (LCS) similarity between two strings.
Longest-common-subsequence (LCS) similarity between two strings.
Normalizes common subsequence length into a score in
[0.0, 1.0], where1.0means both strings share all characters in order.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def levenshtein(left: String, right: String): Column
Levenshtein similarity between two strings (string column name variant).
Levenshtein similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def levenshtein(left: Column, right: Column): Column
Levenshtein similarity between two strings.
Levenshtein similarity between two strings.
Converts edit distance to a normalized similarity score in
[0.0, 1.0], where1.0means exact match and lower values indicate more edits are required.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: String, right: String, innerMetric: String, ngramSize: Int): Column
Monge-Elkan similarity between two strings with custom inner metric and tokenization n-gram size (string column name variant).
Monge-Elkan similarity between two strings with custom inner metric and tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- innerMetric
inner token-level similarity metric name used by Monge-Elkan
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: String, right: String, innerMetric: String): Column
Monge-Elkan similarity between two strings with a custom inner metric (string column name variant).
Monge-Elkan similarity between two strings with a custom inner metric (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- innerMetric
inner token-level similarity metric name used by Monge-Elkan
- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: String, right: String, ngramSize: Int): Column
Monge-Elkan similarity between two strings using custom tokenization n-gram size (string column name variant).
Monge-Elkan similarity between two strings using custom tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: String, right: String): Column
Monge-Elkan similarity between two strings (string column name variant).
Monge-Elkan similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: Column, right: Column, innerMetric: String, ngramSize: Int): Column
Monge-Elkan similarity between two strings with custom inner metric and tokenization n-gram size.
Monge-Elkan similarity between two strings with custom inner metric and tokenization n-gram size.
- left
left input string column
- right
right input string column
- innerMetric
inner token-level similarity metric name used by Monge-Elkan
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: Column, right: Column, innerMetric: String): Column
Monge-Elkan similarity between two strings with a custom inner metric.
Monge-Elkan similarity between two strings with a custom inner metric.
- left
left input string column
- right
right input string column
- innerMetric
inner token-level similarity metric name used by Monge-Elkan
- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: Column, right: Column, ngramSize: Int): Column
Monge-Elkan similarity between two strings using custom tokenization n-gram size.
Monge-Elkan similarity between two strings using custom tokenization n-gram size.
- left
left input string column
- right
right input string column
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def mongeElkan(left: Column, right: Column): Column
Monge-Elkan similarity between two strings.
Monge-Elkan similarity between two strings.
Tokenizes both inputs and compares tokens via an inner similarity metric, then aggregates the best token matches. The result is in
[0.0, 1.0], where higher values indicate stronger similarity.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def needlemanWunsch(left: String, right: String, matchScore: Int, mismatchPenalty: Int, gapPenalty: Int): Column
Needleman-Wunsch global alignment similarity between two strings with custom scoring (string column name variant).
Needleman-Wunsch global alignment similarity between two strings with custom scoring (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- matchScore
score added for aligned matching characters
- mismatchPenalty
penalty applied to aligned non-matching characters
- gapPenalty
penalty applied to insertion/deletion gaps
- returns
alignment-based similarity score where higher is more similar
- def needlemanWunsch(left: String, right: String): Column
Needleman-Wunsch global alignment similarity between two strings (string column name variant).
Needleman-Wunsch global alignment similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
alignment-based similarity score where higher is more similar
- def needlemanWunsch(left: Column, right: Column, matchScore: Int, mismatchPenalty: Int, gapPenalty: Int): Column
Needleman-Wunsch global alignment similarity between two strings with custom scoring.
Needleman-Wunsch global alignment similarity between two strings with custom scoring.
- left
left input string column
- right
right input string column
- matchScore
score added for aligned matching characters
- mismatchPenalty
penalty applied to aligned non-matching characters
- gapPenalty
penalty applied to insertion/deletion gaps
- returns
alignment-based similarity score where higher is more similar
- def needlemanWunsch(left: Column, right: Column): Column
Needleman-Wunsch global alignment similarity between two strings.
Needleman-Wunsch global alignment similarity between two strings.
Scores an end-to-end alignment across full strings. Higher values indicate better global alignment under the configured scoring scheme.
- left
left input string column
- right
right input string column
- returns
alignment-based similarity score where higher is more similar
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
- def overlapCoefficient(left: String, right: String, ngramSize: Int): Column
Overlap coefficient similarity between two strings using custom tokenization n-gram size (string column name variant).
Overlap coefficient similarity between two strings using custom tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def overlapCoefficient(left: String, right: String): Column
Overlap coefficient similarity between two strings (string column name variant).
Overlap coefficient similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def overlapCoefficient(left: Column, right: Column, ngramSize: Int): Column
Overlap coefficient similarity between two strings using custom tokenization n-gram size.
Overlap coefficient similarity between two strings using custom tokenization n-gram size.
- left
left input string column
- right
right input string column
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def overlapCoefficient(left: Column, right: Column): Column
Overlap coefficient similarity between two strings.
Overlap coefficient similarity between two strings.
Computes token intersection relative to the smaller token set. The result is in
[0.0, 1.0], where1.0means one token set is fully contained in the other.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def refinedSoundex(inputColName: String): Column
Refined Soundex phonetic encoding (string column name variant).
Refined Soundex phonetic encoding (string column name variant).
Convenience overload that resolves the given column name and delegates to the
Columnvariant.- inputColName
input string column name
- returns
column expression producing the Refined Soundex code for the input string
- def refinedSoundex(input: Column): Column
Refined Soundex phonetic encoding.
Refined Soundex phonetic encoding.
- input
input column
- returns
column expression producing the Refined Soundex code for the input string
- def smithWaterman(left: String, right: String, matchScore: Int, mismatchPenalty: Int, gapPenalty: Int): Column
Smith-Waterman local alignment similarity between two strings with custom scoring (string column name variant).
Smith-Waterman local alignment similarity between two strings with custom scoring (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- matchScore
score added for aligned matching characters
- mismatchPenalty
penalty applied to aligned non-matching characters
- gapPenalty
penalty applied to insertion/deletion gaps
- returns
alignment-based similarity score where higher is more similar
- def smithWaterman(left: String, right: String): Column
Smith-Waterman local alignment similarity between two strings (string column name variant).
Smith-Waterman local alignment similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
alignment-based similarity score where higher is more similar
- def smithWaterman(left: Column, right: Column, matchScore: Int, mismatchPenalty: Int, gapPenalty: Int): Column
Smith-Waterman local alignment similarity between two strings with custom scoring.
Smith-Waterman local alignment similarity between two strings with custom scoring.
- left
left input string column
- right
right input string column
- matchScore
score added for aligned matching characters
- mismatchPenalty
penalty applied to aligned non-matching characters
- gapPenalty
penalty applied to insertion/deletion gaps
- returns
alignment-based similarity score where higher is more similar
- def smithWaterman(left: Column, right: Column): Column
Smith-Waterman local alignment similarity between two strings.
Smith-Waterman local alignment similarity between two strings.
Scores the best matching local subsequences rather than full-string alignment. Higher values indicate stronger local similarity under the configured scoring scheme.
- left
left input string column
- right
right input string column
- returns
alignment-based similarity score where higher is more similar
- def sorensenDice(left: String, right: String, ngramSize: Int): Column
Sorensen-Dice similarity between two strings using custom tokenization n-gram size (string column name variant).
Sorensen-Dice similarity between two strings using custom tokenization n-gram size (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def sorensenDice(left: String, right: String): Column
Sorensen-Dice similarity between two strings (string column name variant).
Sorensen-Dice similarity between two strings (string column name variant).
Convenience overload that resolves the given column names and delegates to the
Columnvariant.- left
left input string column name
- right
right input string column name
- returns
similarity score in
[0.0, 1.0]
- def sorensenDice(left: Column, right: Column, ngramSize: Int): Column
Sorensen-Dice similarity between two strings using custom tokenization n-gram size.
Sorensen-Dice similarity between two strings using custom tokenization n-gram size.
- left
left input string column
- right
right input string column
- ngramSize
token n-gram size (
0keeps default tokenization)- returns
similarity score in
[0.0, 1.0]
- def sorensenDice(left: Column, right: Column): Column
Sorensen-Dice similarity between two strings.
Sorensen-Dice similarity between two strings.
Measures doubled token intersection over total token counts. The result is in
[0.0, 1.0], where1.0means perfect overlap.- left
left input string column
- right
right input string column
- returns
similarity score in
[0.0, 1.0]
- def soundex(inputColName: String): Column
Soundex phonetic encoding (string column name variant).
Soundex phonetic encoding (string column name variant).
Convenience overload that resolves the given column name and delegates to the
Columnvariant.- inputColName
input string column name
- returns
column expression producing the Soundex code for the input string
- def soundex(input: Column): Column
Soundex phonetic encoding.
Soundex phonetic encoding.
- input
input column
- returns
column expression producing the Soundex code for the input string
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
Deprecated Value Members
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
(Since version 9)