We will explore some methods to find which parts of a text are *similar* to a pattern. For instance, the *text* can be a genome, and the *pattern* can be a gene or a motif, but the same ideas apply to any text and any fixed pattern.

For this, we need a *dissimilarity* function \[ d: \mathcal S \times \mathcal S \to \mathbb R\] such that, for all \(x, y, z \in \mathcal S,\) we have

- \(d(x, y)\geq 0\) (non-negativity)
- \(d(x, x)=0\) (reflexivity)
- \(d(x, y)=d(y, x)\) (symmetry)
- \(d(x, y) + d(y, z)\geq d(x,z)\) (triangular inequality)

that is, the dissimilarity is never a negative number (i.e. it is always positive or zero), the dissimilarity if anything to itself is 0 (i.e everything is similar to itself), the dissimilarity is independent of the order, and the direct comparison of two elements is always better (or the same) than comparing to a third element.

In this case the set \(\mathcal S\) will be the set of *character vectors*, or *strings*. That is, we will compare *words* and parts of texts, such as genes, genomes, motifs and proteins.

Write a function (in any formal language), taking two strings of the same size —named

`x`

and`y`

—, and returning the*Hamming distance*of the two sequences. A good name for this function is`Hamming`

.Remember that the Hamming distance is just the number of letters that are different between the two strings. It is the number of mismatches. If the strings represent DNA, the Hamming distance represents the number of mutations.

Write another function that takes two strings of different size —named

`pattern`

and`text`

—, and calculates the Hamming distance between`pattern`

and the corresponding substring of`text`

at each location. For example in R language, if`pattern`

length is`m`

, then the value at location`k`

will be`ans[k] <- Hamming(pattern, text[k:(k+m-1)])`

If you use Python or other similar language, the indices will be a little different. They key part of this question is to find the valid range for

`k`

.Write a third function, based on the previous one, that takes two strings of different size —named

`pattern`

and`text`

— and a number named`thr`

, and returns a list of all locations in`text`

whose distance to`pattern`

is less or equal that the threshold`thr`

.