The midterm exam has three mandatory questions and one optional. All questions point to evaluate *Computational Thinking* skills: decomposition, pattern recognition, abstraction and algorithm design

# 1. Turtle draws Snow

This question asks us to make a **recursive** function. Have we seen them before?

Yes! We saw examples of recursive functions, such as the *factorial*, the *Fibonacci* and the one drawing trees. We made tree drawings on Quiz 2 and also in classes.

We have seen that *recursive* functions are functions that call themselves. The factorial of `N`

is `N`

times the factorial of `N-1`

. A tree (`N`

levels) has trunk and branches, and each branch is a smaller tree (with `N-1`

levels). We also saw that all recursive functions always have an `if()`

to handle the *exit condition*, typically when `N==1`

.

So we can start with a copy of the `tree()`

function. Here it is:

```
tree <- function(n, length, angle) {
turtle_lwd(n)
turtle_forward(length)
if (n > 1) {
turtle_left(angle)
tree(n-1,length*0.8, angle)
turtle_right(2*angle)
tree(n-1, length*0.8, angle)
turtle_left(angle)
tree(n-1, length*0.8, angle)
}
turtle_backward(length)
}
```

We have to change it to draw *snow* instead of *trees*. First we change the name from `tree`

to `snow`

. We also change the inputs. Instead of `n`

we use `N`

and instead of `length`

we use `L`

. There are no more inputs to `snow()`

.

```
snow <- function(N, L) {
turtle_lwd(N)
turtle_forward(L)
if (N > 1) {
turtle_left(angle)
snow(N-1,L*0.8, angle)
turtle_right(2*angle)
snow(N-1, L*0.8, angle)
turtle_left(angle)
snow(N-1, L*0.8, angle)
}
turtle_backward(L)
}
```

This version does not work, because it uses `angle`

which is not defined. Also, the question says that, instead of `L*0.8`

, each part must be of length `L/3`

. The important parts of the question are:

The turtle draws the first part, turns left 60 degrees, draws the second part, turns 120 degrees to the right, draws another part, turns 60 degrees left, and draws the last part.

[…]

In generalsnow levelof length`N`

`L`

is made of four parts ofsnow levelof length`N-1`

`L/3`

.

So we need to insert one *snow level N-1* and we know the angles:

```
snow <- function(N, L) {
turtle_lwd(N)
turtle_forward(L)
if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}
turtle_backward(L)
}
```

We can test this function and it works 😀 but the drawing is weird 🤪. The lines have different width, and the snow is outside the box. We prefer all lines of the same length, so we have to delete the `turtle_lwd(N)`

command.

The command `snow(1, 80)`

is ok, but the others are outside. So we need to do `turtle_forward(L)`

**only** when `N==1`

. The new version is

```
snow <- function(N, L) {
if(N==1) {
turtle_forward(L)
}
if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}
turtle_backward(L)
}
```

This one is also weird, but at least we get most of the snow inside the box. It looks like we are drawing and moving outside. What can be wrong?

The first `if()`

is ok, because the question says *“Snow level 1 is just a straight line of length L”*. The second

`if()`

is also right. It does exactly what the question asks.What about the `turtle_backward(L)`

? The question does not say anything about “going backwards” or “return to the initial condition”. In fact each *snow* has to start at the end of the previous snow. Let’s delete `turtle_backward(L)`

and see what happens.

```
snow <- function(N, L) {
if(N==1) {
turtle_forward(L)
}
if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}
}
```

Now it works. We can simplify it a last time using `if() {} else {}`

.

```
snow <- function(N, L) {
if(N==1) {
turtle_forward(L)
} else {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}
}
```

Ready! Let’s go to the next question.

# 2. Algorithm design

We have to write a function, and we got this starting point:

Have we seen this before? Maybe is like finding *minimum* or *maximum*. We have never programed our own version of `min()`

, but it should not be so hard, isn’t?

Wait. There is `min()`

and `which.min()`

. This question ask for an *index*, so it is more like `which.min()`

. It is not about the *value* of the half value, it is about the *location* of the half value.

The *half value* is the average of the minimum and the maximum. Let’s re-write this phrase in R language using the starting point:

Is this enough? Let’s test:

`## [1] 5`

Good!

`## [1] 25`

Mmm…🤔 It does not look right. We got the *half value*, not the * location of the half value*. Is like doing

`min()`

instead of `which.min()`

It cannot be that easy. The teacher does not asks *easy*questions. And we are not

*lazy*students. .

Let’s read the question again. The question says:

The

location of the half valueof the vector`x`

is theindex of the first valuethat is equal or bigger than thehalf valueof`x`

.

Therefore we will need an index. Let’s say that the index is `i`

. And we need to check each one of the components of `x`

from the first to the last. So we need a `for(){}`

loop using `i`

as a counter.

```
locate_half <- function(x) {
half_value <- (min(x) + max(x))/2
for(i in 1:length(x)) {
...
return(...)
}
}
```

For each element of `x`

pointed by `i`

we need to make a decision. We want to know if `x[i]`

is bigger or equal to `half_value`

. Every time we make a decision, there must be an `if(){}`

. Let’s write that.

```
locate_half <- function(x) {
half_value <- (min(x) + max(x))/2
for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(...)
}
}
}
```

Now, what shall we do if we find that `x[i] >= half_value`

? In that case we have found the *location of the half value*. It is the *index* of `x[i]`

, in other words, it is `i`

. We have to return that number.

```
locate_half <- function(x) {
half_value <- (min(x) + max(x))/2
for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(i)
}
}
}
```

And we test again

`## [1] 5`

`## [1] 5`

`## [1] 7`

`## [1] 4`

Good! we got the expected result. Can we make it better?

Well, the question says that `x`

is growing, so `x[1]=min(x)`

and `x[length(x)]=max(x)`

. Using indices is faster than using functionsSpeed is not very important now, but if one day we do *big data* then speed is essential. The program has to finish before we get old. so we can rewrite our function as:

```
locate_half <- function(x) {
half_value <- (x[1] + x[length(x)])/2
for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(i)
}
}
}
```

Let’s move to the next question.

# 3. Systems: Polymerase Chain Reaction

The PCR system is shown in the question as this:

Have we seen a similar system before?

## Simulate several PCR cycles

Yes! It is the same plot we got for cell growth in Class 11 and in the blog document. **This is pattern recognition again**. A system depends only on how many circles connect to how many boxes with how many arrows. It does

*not*depend on the names in the circles/boxes, or the way we draw them.

**This is**.

*abstraction*againSo we can start with the code for `cell_culture()`

from the class.

```
cell_culture <- function(N, replication_rate, cells_init, food_ini) {
# create empty vectors
cells <- d_cells <- rep(NA, N)
food <- d_food <- rep(NA, N)
# initialize values
cells[1] <- cells_init
food[1] <- food_ini
d_cells[1] <- d_food[1] <- 0
for(i in 2:N) {
# update `d_cells` and `d_food`
d_cells[i] <- +replication_rate * cells[i-1] * food[i-1]
d_food[i] <- -replication_rate * cells[i-1] * food[i-1]
# update `cells` and `food` as a cumulative sum
cells[i] <- cells[i-1] + d_cells[i]
food[i] <- food[i-1] + d_food[i]
}
return(data.frame(cells, food))
}
```

We have to change the name of the function. We also change the names of the variables, from `cells`

to `DNA`

, from `food`

to `primers`

and from `eating`

to `cycle`

. We change also the names for the variables with names like `d_cells`

. This is easy to do with the *Search & Replace* option in the editor. Also, `replication_rate`

becomes `cycle_rate`

, or just `rate`

, as the question indicates. We got this:

```
pcr <- function(N, rate, dna_ini, primers_ini) {
# create empty vectors
dna <- d_dna <- rep(NA, N)
primers <- d_primers <- rep(NA, N)
# initialize values
dna[1] <- dna_ini
primers[1] <- primers_ini
d_dna[1] <- d_primers[1] <- 0
for(i in 2:N) {
# update `d_dna` and `d_primers`
d_dna[i] <- +rate * dna[i-1] * primers[i-1]
d_primers[i] <- -rate * dna[i-1] * primers[i-1]
# update `dna` and `primers` as a cumulative sum
dna[i] <- dna[i-1] + d_dna[i]
primers[i] <- primers[i-1] + d_primers[i]
}
return(data.frame(dna, primers))
}
```

Now we need to include the *default values* for `rate`

and `primer_ini`

.

```
pcr <- function(N, dna_ini, primers_ini=1e8, rate=1e-8) {
# create empty vectors
dna <- d_dna <- rep(NA, N)
primers <- d_primers <- rep(NA, N)
# initialize values
dna[1] <- dna_ini
primers[1] <- primers_ini
d_dna[1] <- d_primers[1] <- 0
for(i in 2:N) {
# update `d_dna` and `d_primers`
d_dna[i] <- +rate * dna[i-1] * primers[i-1]
d_primers[i] <- -rate * dna[i-1] * primers[i-1]
# update `dna` and `primers` as a cumulative sum
dna[i] <- dna[i-1] + d_dna[i]
primers[i] <- primers[i-1] + d_primers[i]
}
return(data.frame(dna, primers))
}
```

Now we test and see if we did it right.

```
## dna primers
## 1 1000000 100000000
## 2 2000000 99000000
## 3 3980000 97020000
## 4 7841396 93158604
## 5 15146331 85853669
## 6 28150012 72849988
```

Yup! It’s done. And we only used *Search & Replace* and pattern recognition.

## PCR depends on initial concentration

This question asks us to write code directly, not any function. It gives us some initial data:

`## [1] 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06`

For each of these values we need to simulate 30 PCR cycles and get the `dna`

value. Let’s see if we can do just one first.

```
## dna primers
## 1 1.000000e+00 100000000.0
## 2 2.000000e+00 99999999.0
## 3 4.000000e+00 99999997.0
## 4 8.000000e+00 99999993.0
## 5 1.600000e+01 99999985.0
## 6 3.200000e+01 99999969.0
## 7 6.399998e+01 99999937.0
## 8 1.279999e+02 99999873.0
## 9 2.559997e+02 99999745.0
## 10 5.119987e+02 99999489.0
## 11 1.023995e+03 99998977.0
## 12 2.047979e+03 99997953.0
## 13 4.095916e+03 99995905.1
## 14 8.191665e+03 99991809.3
## 15 1.638266e+04 99983618.3
## 16 3.276263e+04 99967238.4
## 17 6.551454e+04 99934486.5
## 18 1.309861e+05 99869014.9
## 19 2.618007e+05 99738200.3
## 20 5.229161e+05 99477084.9
## 21 1.043098e+06 98956903.3
## 22 2.075315e+06 97924686.1
## 23 4.107561e+06 95892440.5
## 24 8.046401e+06 91953600.4
## 25 1.544536e+07 84554645.4
## 26 2.850512e+07 71494879.8
## 27 4.888482e+07 51115177.6
## 28 7.387239e+07 26127613.3
## 29 9.317348e+07 6826521.5
## 30 9.953399e+07 466013.9
```

Ok, we got something. Maybe too much, since we got `dna`

and `primers`

. We only want `dna`

. Since the result is a data frame, we can use any kind of index to get the first column, such as `$dna`

or `[[1]]`

or `[,"dna"]`

. Data frames can be used in many different ways, as we learned last semesterDid we learn that? At least it was in the lectures. .

Let’s try again

```
## [1] 1.000000e+00 2.000000e+00 4.000000e+00 8.000000e+00 1.600000e+01
## [6] 3.200000e+01 6.399998e+01 1.279999e+02 2.559997e+02 5.119987e+02
## [11] 1.023995e+03 2.047979e+03 4.095916e+03 8.191665e+03 1.638266e+04
## [16] 3.276263e+04 6.551454e+04 1.309861e+05 2.618007e+05 5.229161e+05
## [21] 1.043098e+06 2.075315e+06 4.107561e+06 8.046401e+06 1.544536e+07
## [26] 2.850512e+07 4.888482e+07 7.387239e+07 9.317348e+07 9.953399e+07
```

Now we got a vector. We need to do it seven times. We can do it one by one.

```
conc <- data.frame(V1=pcr(N=30, dna_ini=initial_dna[1])$dna,
V2=pcr(N=30, dna_ini=initial_dna[2])$dna,
V3=pcr(N=30, dna_ini=initial_dna[3])$dna,
V4=pcr(N=30, dna_ini=initial_dna[4])$dna,
V5=pcr(N=30, dna_ini=initial_dna[5])$dna,
V6=pcr(N=30, dna_ini=initial_dna[6])$dna,
V7=pcr(N=30, dna_ini=initial_dna[7])$dna)
```

It is not very elegant, but it works. Let’s test it.

```
plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
points(conc[[i]], pch=i, type="b")
}
```

We are ready. We can move forward.

But we can do better. If we still have time, we can improve this code to be more generic. We would like to have a program that works for any number of initial conditions. We have go step by step.

First we need an “empty” data frame. We cannot make a completely empty data frame, but we can start with a single column with an empty vector. Each simulation has 30 steps, so we need a vector of size 30.

Now we need a loop for each element of `initial_dna`

:

Inside the loop we have to fill each of the columns of `conc`

. Can we do that, even if we have only one column? Yes, since data frames can grow. Last semester we saw that it was quite easy to add new columns to a data frame. We just use an index and assign a vector. If the indexed column does not exist on the data frame, it is created *automagically*. Let’s do it.

Let’s see if this works:

```
plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
points(conc[[i]], pch=i, type="b")
}
```

This curves can be measured in real PCR experiments. In some cases we can measure the *optical density (OD)* through the PCR cycle. In other cases we can use fluorescence. Both methods give us a signal that is proportional to DNA concentration. If the real curves match our model, then our model is useful.

Now we can go home.

# 4. Bonus: Quantitative PCR

Bonus! This is optional, but it gives more points and more knowledge. Also, it makes us prouder of our accomplishments. So we do not go home yet.

Here again we have to write code, not a function. We have make a vector called `CT`

, which contains the *half value* of each column of the data frame `conc`

. We find the *half value* with function `locate_half()`

which we already created. And we have to do it for each column of `conc`

. Like in the previous question we can do it one by one:

```
CT <- c(locate_half(conc[[1]]),
locate_half(conc[[2]]),
locate_half(conc[[3]]),
locate_half(conc[[4]]),
locate_half(conc[[5]]),
locate_half(conc[[6]]),
locate_half(conc[[7]]))
```

Hey, doing it one by one is **boring** 😒 and it makes me feel like I’m working for the computer 😖, like in a dystopian movie. Let’s make the computer work for us. If we have to do the same thing more than two times, we write a `for(){}`

loop to do it for us. Life is short, let’s leave the boring stuff for the computers.

We use the same pattern we have done many times. We create an empty vector of the correct sizeWe can even be wrong about the size of the vector. If we make a small vector it will grow as needed later. But is more efficient to do it right from the beginning. and then we use a loop.

Then we draw the plot and we see the points in a “straight” line.

This result is important because it means that there is a relationship between the initial DNA concentration and the number of cycles required to cross a fixed threshold. The name `CT`

means *cycle threshold*.

If we know this relationship, we can determine with high precision what was the initial DNA concentration. That is the idea of *Quantitative PCR*.

Let’s find that relationship We remember that straight lines can be modeled with *linear models*So *linear models* are part of *molecular biology*? Yes, very much. Good that we learned them last semester. . The name says it all. The plot uses *log* in the vertical coordinate, so the linear model also needs logarithmic scale. We can use any logarithm, since all do the trick. In this case we can use `log10`

since it is easier to understand.

```
##
## Call:
## lm(formula = log10(initial_dna) ~ CT)
##
## Coefficients:
## (Intercept) CT
## 8.3241 -0.3006
```

Since *Intercept* is 8.3241 then the concentration corresponding to CT=0 is \(10^{8.3241}\)=2.1091137 × 10^{8}. With so much initial DNA the threshold is crossed before we start the experiment.

The slope constant for CT is -0.3, meaning that when CT increases by one, the initial DNA concentration decreases \(10^{-0.3}\)=0.5011872 times. That is like half. In other words, double DNA means one less cycle. This means that the PCR reaction *duplicates* the number of DNA molecules on each cycle. That happens only when we have 100% efficiency.

Let’s make a table:

```
ans <- data.frame(CT = seq(from=0, to=40, by=2))
ans$initial_DNA <- 10^predict(model, newdata = ans)
knitr::kable(ans, format.args=list(digits=2))
```

CT | initial_DNA |
---|---|

0 | 2.1e+08 |

2 | 5.3e+07 |

4 | 1.3e+07 |

6 | 3.3e+06 |

8 | 8.3e+05 |

10 | 2.1e+05 |

12 | 5.2e+04 |

14 | 1.3e+04 |

16 | 3.3e+03 |

18 | 8.2e+02 |

20 | 2.1e+02 |

22 | 5.2e+01 |

24 | 1.3e+01 |

26 | 3.2e+00 |

28 | 8.1e-01 |

30 | 2.0e-01 |

32 | 5.1e-02 |

34 | 1.3e-02 |

36 | 3.2e-03 |

38 | 8.0e-04 |

40 | 2.0e-04 |

This means that, if we can do 40 PCR cycles, we would be able to detect DNA in low concentrations, as low as 0.0002, which is 2 molecules in 10 liters. Amazing.

Now, what is the *margin of error* for this measurement? That is matter for another day.