[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280
[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280lyne7-sc wants to merge 10 commits into
Conversation
…spark_levenshtein # Conflicts: # native-engine/auron-planner/src/planner.rs
| Ok(ColumnarValue::Array(Arc::new(splitted_builder.finish()))) | ||
| } | ||
|
|
||
| pub fn spark_levenshtein(args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Correct me if I'm wrong but, in spark_levenshtein, a null threshold is coerced to Some(0), which then returns -1 for any non-zero distance and 0 for equal strings:
Some(ColumnarValue::Scalar(scalar)) if scalar.is_null() => Some(0),
...
Some(array) if array.data_type() == &DataType::Null => Some(0),
Some(_) => thresholds.map(|array| if array.is_valid(i) { array.value(i) } else { 0 }),
I think that doesn't match Spark. In Spark, Levenshtein.eval only null-checks left/right..a null threshold at runtime causes an NPE during v.asInstanceOf[Int]. The defensible options are (a) propagate null when threshold is null, or (b) mirror Spark and error
There was a problem hiding this comment.
Thanks for catching this! You're correct. I double-confirmed this and have fixed it by propagating null. Updated tests accordingly.
| let distance = if left == right { | ||
| 0 | ||
| } else { | ||
| let left_chars = left.chars().collect::<Vec<_>>(); | ||
| let right_chars = right.chars().collect::<Vec<_>>(); | ||
| if left_chars.is_empty() { | ||
| right_chars.len() as i32 | ||
| } else if right_chars.is_empty() { | ||
| left_chars.len() as i32 | ||
| } else { | ||
| let mut previous = (0..=right_chars.len()).collect::<Vec<_>>(); | ||
| let mut current = vec![0; right_chars.len() + 1]; | ||
|
|
||
| for (i, left_char) in left_chars.iter().enumerate() { | ||
| current[0] = i + 1; | ||
| for (j, right_char) in right_chars.iter().enumerate() { | ||
| let substitution_cost = usize::from(left_char != right_char); | ||
| current[j + 1] = (current[j] + 1) | ||
| .min(previous[j + 1] + 1) | ||
| .min(previous[j] + substitution_cost); | ||
| } | ||
| std::mem::swap(&mut previous, &mut current); | ||
| } | ||
| previous[right_chars.len()] as i32 | ||
| } | ||
| }; | ||
| Some(match threshold { | ||
| Some(threshold) if distance > threshold => -1, | ||
| _ => distance, | ||
| }) |
There was a problem hiding this comment.
Makes sense. I’ve refactored this to align with Spark’s thresholded Levenshtein behavior:
Which issue does this PR close?
Closes #2279
Rationale for this change
Spark 4+ introduced an optional
thresholdparameter forLevenshtein. The previous native implementation delegated to DataFusion's built-inlevenshtein()which only supports 2 arguments.What changes are included in this PR?
spark_levenshteinnative function that supports both 2-arg and 3-arg (with threshold) formsNativeConverters.scalato routeLevenshteinthroughbuildExtScalarFunction("Spark_Levenshtein")instead of the oldbuildScalarFunctionpath.exclude("string Levenshtein distance")from Spark 4.0 and 4.1 test settingsAre there any user-facing changes?
Yes. The native
levenshtein()function now supports the optional thirdthresholdparameter.How was this patch tested?
AuronStringFunctionsSuiteandAuronFunctionSuiteincluding "test function Levenshtein"