We will explore some methods to find which parts of a text are similar to a pattern. For instance, the text can be a genome, and the pattern can be a gene or a motif, but the same ideas apply to any text and any fixed pattern.
For this, we need a dissimilarity function \[ d: \mathcal S \times \mathcal S \to \mathbb R\] such that, for all \(x, y, z \in \mathcal S,\) we have
- \(d(x, y)\geq 0\) (non-negativity)
- \(d(x, x)=0\) (reflexivity)
- \(d(x, y)=d(y, x)\) (symmetry)
- \(d(x, y) + d(y, z)\geq d(x,z)\) (triangular inequality)
that is, the dissimilarity is never a negative number (i.e. it is always positive or zero), the dissimilarity if anything to itself is 0 (i.e everything is similar to itself), the dissimilarity is independent of the order, and the direct comparison of two elements is always better (or the same) than comparing to a third element.
In this case the set \(\mathcal S\) will be the set of character vectors, or strings. That is, we will compare words and parts of texts, such as genes, genomes, motifs and proteins.
Write a function (in any formal language), taking two strings of the same size —named
y—, and returning the Hamming distance of the two sequences. A good name for this function is
Remember that the Hamming distance is just the number of letters that are different between the two strings. It is the number of mismatches. If the strings represent DNA, the Hamming distance represents the number of mutations.
Write another function that takes two strings of different size —named
text—, and calculates the Hamming distance between
patternand the corresponding substring of
textat each location. For example in R language, if
m, then the value at location
ans[k] <- Hamming(pattern, text[k:(k+m-1)])
If you use Python or other similar language, the indices will be a little different. They key part of this question is to find the valid range for
Write a third function, based on the previous one, that takes two strings of different size —named
text— and a number named
thr, and returns a list of all locations in
textwhose distance to
patternis less or equal that the threshold