title: "Edit Distance (Levenshtein Distance)" description: "Implement Levenshtein Distance for spell checking, DNA sequence alignment, and fuzzy search." order: 4
Edit Distance (Levenshtein Distance)
What is Edit Distance?
Levenshtein Distance calculates the minimum edits (insert, delete, substitute) to transform one string into another.
| Operation | Cost | Example | |-----------|------|---------| | Insert | 1 | "cat" -> "cats" | | Delete | 1 | "cats" -> "cat" | | Substitute | 1 | "cat" -> "cut" | | Match | 0 | "cat" -> "cat" |
Implementation
def edit_distance(s1, s2):
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1): dp[i][0] = i
for j in range(n + 1): dp[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
return dp[m][n]
print(f"kitten -> sitting: {edit_distance('kitten', 'sitting')} steps")
print(f"Vibe -> Vibes: {edit_distance('Vibe', 'Vibes')} steps")
Applications
| Domain | Use Case | |--------|----------| | Spell Checker | Suggest correct spelling for typos | | DNA Sequencing | Calculate genetic mutation distance | | Plagiarism Detection | Detect modified plagiarism | | Speech Recognition | Match transcribed text to known phrases | | Fuzzy Search | Find matches even with typos |
Why Edit Distance Matters
| Scenario | Application | |----------|-------------| | Spell Check | "recieve" vs "receive" distance = 1 | | DNA Alignment | Gene sequence similarity | | Plagiarism Detection | Smaller distance = more similar | | Auto Grading | Error count from OCR results | | Google Search | "Did you mean: xxx" |
Space Optimization
def edit_distance_optimized(s1, s2):
m, n = len(s1), len(s2)
if m < n: s1, s2 = s2, s1; m, n = n, m
prev = list(range(n + 1))
curr = [0] * (n + 1)
for i in range(1, m + 1):
curr[0] = i
for j in range(1, n + 1):
cost = 0 if s1[i-1] == s2[j-1] else 1
curr[j] = min(prev[j] + 1, curr[j-1] + 1, prev[j-1] + cost)
prev, curr = curr, prev
return prev[n]
Next Chapter: Spell Check API
Edit distance wrapped as a FastAPI service - the practical application.
Applications of Edit Distance
| Domain | Application | |--------|-------------| | Spell Checking | Suggest corrections for misspelled words | | DNA Sequencing | Measure similarity between genetic sequences | | Plagiarism Detection | Compare documents for copied content | | Speech Recognition | Match recognized words to dictionary entries | | Version Control | Track differences between file versions (git diff) | | Autocomplete | Rank suggestion candidates by edit distance | | Machine Translation | Align source and target language sentences |
How Edit Distance Works
The algorithm uses dynamic programming to build an (m+1) x (n+1) table where dp[i][j] represents the minimum edit distance between the first i characters of s1 and the first j characters of s2.
Recurrence Relation
dp[i][j] = min(
dp[i-1][j] + 1, # delete s1[i]
dp[i][j-1] + 1, # insert s2[j]
dp[i-1][j-1] + cost # substitute (cost = 0 if same, 1 if different)
)
Example Table for "kitten" → "sitting"
| | ε | s | i | t | t | i | n | g | |---|---|---|---|---|---|---|---|---| | ε | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | k | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | i | 2 | 2 | 1 | 2 | 3 | 4 | 5 | 6 | | t | 3 | 3 | 2 | 1 | 2 | 3 | 4 | 5 | | t | 4 | 4 | 3 | 2 | 1 | 2 | 3 | 4 | | e | 5 | 5 | 4 | 3 | 2 | 2 | 3 | 4 | | n | 6 | 6 | 5 | 4 | 3 | 3 | 2 | 3 |
The final answer dp[6][7] = 3. The minimum edit distance from "kitten" to "sitting" is 3 operations:
- Substitute k → s
- Substitute e → i
- Insert g
Full DP Table Implementation
def edit_distance_full_table(s1: str, s2: str) -> int:
"""
Compute minimum edit distance between s1 and s2.
Operations: insert, delete, substitute (all cost 1).
"""
m, n = len(s1), len(s2)
# Create DP table
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Initialize first row and column
for i in range(m + 1):
dp[i][0] = i # deleting all characters from s1
for j in range(n + 1):
dp[0][j] = j # inserting all characters of s2
# Fill the DP table
for i in range(1, m + 1):
for j in range(1, n + 1):
cost = 0 if s1[i - 1] == s2[j - 1] else 1
dp[i][j] = min(
dp[i - 1][j] + 1, # delete
dp[i][j - 1] + 1, # insert
dp[i - 1][j - 1] + cost # substitute
)
return dp[m][n]
# Test
print(edit_distance_full_table("kitten", "sitting")) # 3
print(edit_distance_full_table("horse", "ros")) # 3
print(edit_distance_full_table("", "abc")) # 3
print(edit_distance_full_table("same", "same")) # 0
Weighted Edit Distance
Different operations can have different costs:
def weighted_edit_distance(s1: str, s2: str,
insert_cost=1, delete_cost=1, substitute_cost=2):
"""
Edit distance with configurable operation costs.
Useful when substitution should cost more than insert/delete.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
dp[i][0] = i * delete_cost
for j in range(n + 1):
dp[0][j] = j * insert_cost
for i in range(1, m + 1):
for j in range(1, n + 1):
cost = 0 if s1[i - 1] == s2[j - 1] else substitute_cost
dp[i][j] = min(
dp[i - 1][j] + delete_cost,
dp[i][j - 1] + insert_cost,
dp[i - 1][j - 1] + cost
)
return dp[m][n]
print(weighted_edit_distance("kitten", "sitting")) # 5 (2 changes x 2 + 1 insert = 5)
Damerau-Levenshtein Distance
Adds the transposition operation (swap adjacent characters):
def damerau_levenshtein(s1: str, s2: str) -> int:
"""
Edit distance with transpositions.
"cat" → "act" costs 1 (swap adjacent c and a) instead of 2.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
dp[i][0] = i
for j in range(n + 1):
dp[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
cost = 0 if s1[i - 1] == s2[j - 1] else 1
dp[i][j] = min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i - 1][j - 1] + cost
)
# Transposition check
if i > 1 and j > 1 and s1[i - 1] == s2[j - 2] and s1[i - 2] == s2[j - 1]:
dp[i][j] = min(dp[i][j], dp[i - 2][j - 2] + cost)
return dp[m][n]
print(damerau_levenshtein("cat", "act")) # 1 (transposition)
print(damerau_levenshtein("kitten", "sitting")) # 3
Finding the Operations (Backtrace)
def edit_distance_with_operations(s1: str, s2: str):
"""
Compute edit distance and return the operations.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
dp[i][0] = i
for j in range(n + 1):
dp[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
cost = 0 if s1[i - 1] == s2[j - 1] else 1
dp[i][j] = min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i - 1][j - 1] + cost
)
# Backtrace to find operations
operations = []
i, j = m, n
while i > 0 or j > 0:
if i > 0 and dp[i][j] == dp[i - 1][j] + 1:
operations.append(f"Delete '{s1[i-1]}' at position {i-1}")
i -= 1
elif j > 0 and dp[i][j] == dp[i][j - 1] + 1:
operations.append(f"Insert '{s2[j-1]}' at position {j-1}")
j -= 1
else:
if s1[i - 1] != s2[j - 1]:
operations.append(f"Substitute '{s1[i-1]}' → '{s2[j-1]}' at position {i-1}")
i -= 1
j -= 1
operations.reverse()
return dp[m][n], operations
# Test
dist, ops = edit_distance_with_operations("kitten", "sitting")
print(f"Distance: {dist}")
print("Operations:")
for op in ops:
print(f" {op}")
Similarity Score
Convert edit distance to a similarity percentage:
def similarity(s1: str, s2: str) -> float:
"""
Compute string similarity as a percentage (0.0 to 1.0).
1.0 = identical, 0.0 = completely different.
"""
distance = edit_distance_full_table(s1, s2)
max_len = max(len(s1), len(s2))
if max_len == 0:
return 1.0
return 1.0 - (distance / max_len)
print(similarity("kitten", "sitting")) # ~0.57
print(similarity("same", "same")) # 1.0
print(similarity("abc", "xyz")) # 0.0
Summary
Edit distance (Levenshtein distance) measures how many single-character operations are needed to transform one string into another. Dynamic programming fills a table using the recurrence: min(delete, insert, substitute). Variants include weighted costs, Damerau-Levenshtein (adds transposition), and similarity scoring.
Key takeaways:
- DP table: dp[i][j] = min cost to convert first i chars of s1 to first j chars of s2
- Recurrence: min(delete, insert, substitute) with cost 0 if characters match
- Space optimization: only keep two rows (O(min(m,n)) instead of full O(mn))
- Weighted version: different costs for insert, delete, substitute
- Damerau-Levenshtein: adds transposition (swap adjacent characters)
- Backtrace: reconstruct the actual operations from the DP table
- Similarity: 1 - (distance / max_len) gives a 0-1 similarity score
- Applications: spell check, DNA analysis, plagiarism detection, autocomplete
What's Next: Spell Check API
The next chapter wraps edit distance into a FastAPI spell-check service — a practical API that suggests corrections for misspelled words.