Longest Common Subsequence — Comparing Strings with DP
Why LCS Matters
The Longest Common Subsequence (LCS) problem finds the longest sequence of characters that appears in the same order in two strings. Unlike substrings, subsequences do not need to be contiguous — they just need to appear in order.
Why this matters for your career:
- LCS powers
git diff— it finds the minimal changes between file versions - DNA sequencing uses LCS to align genetic sequences
- Plagiarism detectors compare document similarity using LCS
- Version control systems rely on LCS for merge operations
- LCS is a classic DP problem asked in technical interviews
LCS vs. Substring
| Feature | Subsequence | Substring | |---------|-------------|----------| | Contiguous? | No (can skip characters) | Yes (must be consecutive) | | Example in "abcdef" | "ace", "bf", "adf" | "abc", "cde", "def" | | LCS of "ABC" and "ACB" | "AB" or "AC" (length 2) | "A" or "B" or "C" (length 1) | | Algorithm | Dynamic Programming O(mn) | Sliding window O(m+n) | | Use Case | Diff, DNA alignment | Pattern matching |
DP Table Approach
LCS uses a 2D DP table where dp[i][j] = LCS length of first i chars of X and first j chars of Y.
Recurrence Relation
if X[i-1] == Y[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
Example: LCS of "ABCBDAB" and "BDCABA"
ε B D C A B A
ε 0 0 0 0 0 0 0
A 0 0 0 0 1 1 1
B 0 1 1 1 1 2 2
C 0 1 1 2 2 2 2
B 0 1 1 2 2 3 3
D 0 1 2 2 2 3 3
A 0 1 2 2 3 3 4
B 0 1 2 2 3 4 4
LCS length = 4. One possible LCS: "BCAB" or "BDAB".
Implementation
def lcs_length(X: str, Y: str) -> int:
"""Return the length of the LCS of X and Y."""
m, n = len(X), len(Y)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if X[i - 1] == Y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
return dp[m][n]
print(lcs_length("ABCBDAB", "BDCABA")) # 4
print(lcs_length("ABCD", "ACBD")) # 3
print(lcs_length("ABC", "DEF")) # 0
Finding the Actual LCS String
def lcs_string(X: str, Y: str) -> str:
"""Return one LCS string between X and Y."""
m, n = len(X), len(Y)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if X[i - 1] == Y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# Backtrace to find the actual LCS
i, j = m, n
result = []
while i > 0 and j > 0:
if X[i - 1] == Y[j - 1]:
result.append(X[i - 1])
i -= 1
j -= 1
elif dp[i - 1][j] > dp[i][j - 1]:
i -= 1
else:
j -= 1
return "".join(reversed(result))
print(lcs_string("ABCBDAB", "BDCABA")) # BCAB or BDAB
print(lcs_string("ABCD", "ACBD")) # ABD
Space-Optimized LCS Length
def lcs_length_optimized(X: str, Y: str) -> int:
"""LCS length using O(min(m,n)) space."""
if len(X) < len(Y):
X, Y = Y, X
m, n = len(X), len(Y)
prev = [0] * (n + 1)
curr = [0] * (n + 1)
for i in range(1, m + 1):
for j in range(1, n + 1):
if X[i - 1] == Y[j - 1]:
curr[j] = prev[j - 1] + 1
else:
curr[j] = max(prev[j], curr[j - 1])
prev, curr = curr, prev
return prev[n]
print(lcs_length_optimized("ABCBDAB", "BDCABA")) # 4
Applications
| Domain | Application |
|--------|-------------|
| Version Control | git diff finds minimal edits between file versions |
| Bioinformatics | DNA sequence alignment (human vs. chimpanzee) |
| Plagiarism Detection | Compare student submissions against a database |
| Text Comparison | Find common content between two documents |
| Speech Recognition | Align recognized words with expected phrases |
| Data Deduplication | Find duplicate records across databases |
| Code Review | Find copied code across repositories |
LCS in Git Diff
# Git uses LCS to find minimal edits
Original: "The quick brown fox"
Modified: "The fast brown fox"
LCS: "The brown fox"
Deletions: "quick"
Insertions: "fast"
# Result:
- The quick brown fox
+ The fast brown fox
Summary
LCS finds the longest common subsequence between two strings using dynamic programming. The DP table builds up the solution, and backtracing recovers the actual subsequence. Space optimization reduces memory from O(mn) to O(min(m,n)).
Key takeaways:
- LCS finds characters in the same order (not necessarily contiguous)
- DP recurrence: match → dp[i-1][j-1]+1, no match → max(dp[i-1][j], dp[i][j-1])
- Backtrace recovers the actual LCS string from the DP table
- Space-optimized: O(min(m,n)) memory, O(mn) time
- Applications: git diff, DNA alignment, plagiarism detection
- LCS vs substring: LCS allows gaps, substring does not
- LCS is a fundamental DP problem — master the recurrence and backtrace
What's Next: Edit Distance
The next chapter covers Edit Distance (Levenshtein Distance) — a variant of LCS that counts insertions, deletions, and substitutions to transform one string into another.