Each question will be answered in a separate file. Please download the answer files for Question 1 and Question 2. Write your name and student number in the correct place at the beginning of each answer file.

Question 0 should be answered immediately, right now. Stop reading now and answer the question. Only continue reading after you delivered the picture. If you do not deliver this right now, your exam will not be graded, and you will get 0.

If you have any issues or questions, write to me at my official email andres.aravena@istanbul.edu.tr (only for questions, not for answers). When you finish each answer send it to andres.aravena+cmb@istanbul.edu.tr (only for answers, not for questions). You can send each file one-by-one when you finish it.

# 0. Ethical Commitment

Copy the following text in a blank paper. Write it with your own calligraphy. Sign it, take a picture and send it immediately to .

“Şerefim üzerine söz veririm ki, bu sınav sırasında etik kurallari çiǧnemedim”

Full Name:

Student Number:

Signature:

Date:

All answers are strictly personal. Any unethical behavior will be penalized.

# 1. Breeding rabbits

Nine hundred years ago the Italian businessman Leonardo Bonacci (known as Fibonacci) created a famous model to predict the number of rabbits to breed. We want to make an improved model, by assuming the following hypotheses:

• We will count pairs of rabbits.
• The population begins in the first month with one pair of baby rabbits.
• In one month, rabbits become adult
• One more later, rabbits become pregnant.
• Exactly one month after becoming pregnant, rabbits produce a new pair of rabbits.
• Rabbits never die or stop reproducing.
• Once rabbits become pregnant for the first time, they will be pregnant every month.

We want to know how many baby rabbits we will have every month. We start with a single pair of baby rabbits in the first month. In the second month, the rabbits become adults (so there are 0 baby rabbits). In the third month, rabbits are pregnant, so still 0 baby rabbits. On month 4 we get a new pair of baby rabbits. In other words, the number of baby rabbits on any month is equal to the number of pregnant rabbits the month before, which depends on the number of baby rabbits three months earlier.

Doing some abstraction, we can create a function baby_rabbits(n) to represent the number of pairs of baby rabbits on month n. This function must follow the following rules:

• On month 1 there is 1 pair of baby rabbits, so baby_rabbits(1) gives 1
• On months 2 and 3 there are 0 baby rabbits, so baby_rabbits(2) and baby_rabbits(3) must be 0.
• After the third month, on month n, the number of baby rabbits is baby_rabbits(n-1) + baby_rabbits(n-3)

Please notice that this is not the typical Fibonacci sequence. This is another sequence.

## 1.1 Recursive solution

Write a recursive function in R that has input n and output baby_rabbits(n). You can test your function by checking that at the end of the first year there are 144 pairs of rabbits.

baby_rabbits <- function(n) {
# Write your code here
}

You can test your function by calculating the number of baby rabbits for two years.

bunnies <- rep(NA, 24)
for(n in 1:length(bunnies)) {bunnies[n] <- baby_rabbits(n)}
bunnies
##  [1]    1    0    0    1    1    1    2    3    4    6    9   13   19   28   41
## [16]   60   88  129  189  277  406  595  872 1278

## 1.2 System-modeling solution

The recursive formula is fine but slow. It takes too much time to finish when n is big. To solve this we can model the rabbits as a system. We have three kinds of rabbits: baby, adult and pregnant rabbits. In each month baby rabbits grow, so the number of adult rabbits increases and the number of baby rabbits decreases. The same happens with adult and pregnant rabbits. On the other side, each of pregnant rabbit produces a new baby rabbit, so the number of baby rabbits increases by the number of pregnant rabbits. We can represent this system by the following figure:

The initial conditions are: baby_ini=1, adult_ini=0 and pregnant_ini=0. The rate constants for growth, mating and birth are 1. Please simulate the system for N=24 steps. To make the problem easier, you can use the following skeleton and complete it.

breeding <- function(N, baby_ini, adult_ini, pregnant_ini,
growth_rate, mating_rate, birth_rate) {
baby     <- d_baby     <- rep(NA, N)
adult    <- d_adult    <- rep(NA, N)
pregnant <- d_pregnant <- rep(NA, N)
baby[1]     <- baby_ini
adult[1]    <- adult_ini
pregnant[1] <- pregnant_ini
d_baby[1] <- d_adult[1] <- d_pregnant[1] <- 0
for(i in 2:N) {
# write your code here
baby[i]     <- baby[i-1]     + d_baby[i]
adult[i]    <- adult[i-1]    + d_adult[i]
pregnant[i] <- pregnant[i-1] + d_pregnant[i]
}
return(data.frame(Month=1:N, baby, adult, pregnant))
}

If everything is right, the following code should give the same result as the recursive function.

rabbits <- breeding(N=24, baby_ini=1, adult_ini=0, pregnant_ini=0,
growth_rate=1, mating_rate=1, birth_rate=1)
rabbits\$baby
##  [1]    1    0    0    1    1    1    2    3    4    6    9   13   19   28   41
## [16]   60   88  129  189  277  406  595  872 1278

## 1.3 Change parameters to find new_rabbits

What happens if the pregnant rabbits produce more baby rabbits every month, but not all of the babies survive to be adults? For example, what happens if birth_rate is 2 and growth_rate is 0.5?

Use the same breeding() function to simulate a new instance of the system under the new conditions. Store the results in a data frame called new_rabbits. Draw the following plot showing the column baby of data frame rabbits (from previous question) and the same column in data frame new_rabbits.

# write here

## 1.4 Find the growth rate of new_rabbits

As with most unbounded growth processes, this system seems to grow exponentially. What is the growth rate?

Write the code to fit a linear model for the column baby in new_rabbits and calculate the growth rate from the fitted coefficients. You may need to transform the data.

## (Intercept)       Month
##   0.1955867   1.5738819
# write here

# 2. Codon Adaptation Index

An important question in molecular biology is to identify which genes are easier to translate and which ones are harder. One of the most common ways to find genes that are “easy to translate” is to evaluate their Codon Adaptation Index (CAI). High CAI values suggests that the gene is important for cell survival, probably a housekeeping gene. A low CAI values suggests that the gene is seldom used, maybe acquired recently through horizontal transfer.

The CAI is based on the following idea. Most amino-acids are encoded by several codons. For example Valine can be encoded by “gta”, “gtc”, “gtg”, or “gtt”. Each organism has a different preference for which codon to use. The translation mechanism is more efficient for some codons. For example, tRNA genes are different in each organism.

The relative adaptiveness of each codon is the ratio of the usage of each codon, to that of the most abundant codon for the same amino acid. The CAI index is defined as the geometric mean of these relative adaptiveness values. This can be a little hard to calculate, so instead, we will solve a simpler problem, which is mathematically equivalent.

We will calculate the average of the logarithms of each codon’s relative adaptiveness. Non-synonymous codons and termination codons are not counted. For more details, you can read the original paper by Sharp PM et al. (Nucleic Acids Res. 1987;15(3)) or the Wikipedia entry.

In summary:

• First, we want to know which codons are the most used for each amino-acid
• That codon will have a relative adaptiveness of 1
• The rest of the codons for the same amino-acid have relative adaptiveness proportional to their frequency
• Then we can estimate how efficient is each gene’s translation
• We calculate a weighted average of each codon’s relative adaptiveness
• The weights are different on each gene
• Specifically, the weights are the counts of each codon in the gene.

Here you have the code of the function codon_count(). You can use it in your answer, you do not need to re-write it.

valid_codons <- c("aaa", "aac", "aag", "aat", "aca", "acc", "acg", "act",
"aga", "agc", "agg", "agt", "ata", "atc", "atg", "att", "caa", "cac",
"cag", "cat", "cca", "ccc", "ccg", "cct", "cga", "cgc", "cgg", "cgt",
"cta", "ctc", "ctg", "ctt", "gaa", "gac", "gag", "gat", "gca", "gcc",
"gcg", "gct", "gga", "ggc", "ggg", "ggt", "gta", "gtc", "gtg", "gtt",
"taa", "tac", "tag", "tat", "tca", "tcc", "tcg", "tct", "tga", "tgc",
"tgg", "tgt", "tta", "ttc", "ttg", "ttt")

codon_count <- function(gene) {
answer <- rep(0, 64)
names(answer) <- valid_codons
for(i in seq(from=1, to=length(gene), by=3)) {
codon <- tolower(paste0(gene[i], gene[i+1], gene[i+2]))
answer[codon] <- answer[codon] + 1
}
return(answer[valid_codons])
}

## 2.1 Download the FASTA file containing all the genes of your bacteria and read them into the list genes.

Use the same bacteria that was assigned to you in the midterm exam. This is just code, not a function.

# Write your code here

Now you can show the number of genes. This number will be different for each bacteria.

length(genes)
## [1] 4140

## 2.2 Create a list called genes_codon_count, containing one vector for each gene in the bacterial genome. Each vector is the result of using codon_count() on each gene.

This is just code, not a function.

genes_codon_count <- list()
# write your code here

The length of genes_codon_count should be equal to the number of genes. This number will be different for each bacteria.

length(genes_codon_count)
## [1] 4140

and the length of each element should be 64, always.

length(genes_codon_count[[1]])
## [1] 64

## 2.3 Create the vector total_codon_count containing the total number of times that each codon appears on all genes.

This is just code, not a function.

total_codon_count <- rep(0, 64)
names(total_codon_count) <- valid_codons
# write your code here

Now you can show the number of times that each codon is in all the genes. These numbers will be different for each bacteria.

total_codon_count
##   aaa   aac   aag   aat   aca   acc   acg   act   aga   agc   agg   agt   ata
## 44236 28319 13384 22756  8975 30972 18970 11577  2489 21131  1363 11322  5345
##   atc   atg   att   caa   cac   cag   cat   cca   ccc   ccg   cct   cga   cgc
## 33331 36700 40171 20208 12814 38152 16937 11058  7138 30969  9128  4523 29301
##   cgg   cgt   cta   ctc   ctg   ctt   gaa   gac   gag   gat   gca   gcc   gcg
##  6983 27843  5072 14702 70390 14403 52330 25214 23456 42135 26535 33898 44900
##   gct   gga   ggc   ggg   ggt   gta   gtc   gtg   gtt   taa   tac   tag   tat
## 19999 10216 39366 14464 32655 14325 20227 34796 24031  2678 16079   287 21055
##   tca   tcc   tcg   tct   tga   tgc   tgg   tgt   tta   ttc   ttg   ttt
##  9154 11321 11747 10986  1178  8482 20060  6706 18085 21827 17992 29304

## 2.4 Compute the relative adaptiveness of each codon

This is the critical part of this exam. We need to calculate the relative adaptiveness of each codon, using the values in the total_codon_count vector.

The adaptiveness is calculated amino-acid by amino-acid. For each amino-acid AA, the adaptiveness of codon C is the logarithm of the count for codon C divided by the maximum count among all the codons encoding for amino-acid AA.

In other words, you can calculate the adaptiveness of every codon, and store only these codons it in the vector codon_adaptness, by doing the following steps:

• For each amino-acid AA:
• Find the codons that encode AA it and store them in a vector called codons_for_aa.
• Find the maximum of total_codon_count among the codons in codons_for_aa. Look only at these codons, ignore the rest. Store this maximum in max_count_for_aa.
• For each codon in codons_for_aa:
• Find the value of total_codon_count for that particular codon.
• Divide it by max_count_for_aa.
• Calculate the logarithm of the result of the last division, and store it in codon_adaptness using the codon as the index.
• Repeat for each codon in codons_for_aa.
• Repeat for each amino acid.

The result will be the vector codon_adaptness that you will use in the following questions. This is just code, not a function. It is recommended to use the following list, that describes which codons encode each amino-acid.

genetic_code <- list(Ala = c("gca", "gcc", "gcg", "gct"),
Arg = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),
Asn = c("aac", "aat"), Asp = c("gac", "gat"),
Cys = c("tgc", "tgt"), Gln = c("caa", "cag"),
Glu = c("gaa", "gag"), Gly = c("gga", "ggc", "ggg", "ggt"),
His = c("cac", "cat"), Ile = c("ata", "atc", "att"),
Leu = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),
Lys = c("aaa", "aag"), Phe = c("ttc", "ttt"),
Pro = c("cca", "ccc", "ccg", "cct"),
Ser = c("agc", "agt", "tca", "tcc", "tcg", "tct"),
Thr = c("aca", "acc", "acg", "act"),
Tyr = c("tac", "tat"), Val = c("gta", "gtc", "gtg", "gtt"))
codon_adaptness <- rep(0, 64)
names(codon_adaptness) <- valid_codons
# Write your code here

The result should look something like this.

codon_adaptness
##         aaa         aac         aag         aat         aca         acc
##  0.00000000  0.00000000 -1.19547897 -0.21870411 -1.23864064  0.00000000
##         acg         act         aga         agc         agg         agt
## -0.49022479 -0.98406320 -2.46574062  0.00000000 -3.06793349 -0.62399342
##         ata         atc         atg         att         caa         cac
## -2.01698380 -0.18665745  0.00000000  0.00000000 -0.63549962 -0.27896225
##         cag         cat         cca         ccc         ccg         cct
##  0.00000000  0.00000000 -1.02983256 -1.46755408  0.00000000 -1.22164009
##         cga         cgc         cgg         cgt         cta         ctc
## -1.86844615  0.00000000 -1.43414302 -0.05104006 -2.63031599 -1.56606767
##         ctg         ctt         gaa         gac         gag         gat
##  0.00000000 -1.58661469  0.00000000 -0.51347935 -0.80244349  0.00000000
##         gca         gcc         gcg         gct         gga         ggc
## -0.52597318 -0.28108178  0.00000000 -0.80875552 -1.34894738  0.00000000
##         ggg         ggt         gta         gtc         gtg         gtt
## -1.00123970 -0.18690452 -0.88749618 -0.54248409  0.00000000 -0.37015777
##         taa         tac         tag         tat         tca         tcc
##  0.00000000 -0.26962399  0.00000000  0.00000000 -0.83655021 -0.62408175
##         tcg         tct         tga         tgc         tgg         tgt
## -0.58714327 -0.65411942  0.00000000  0.00000000  0.00000000 -0.23494362
##         tta         ttc         ttg         ttt
## -1.35896834 -0.29457629 -1.36412399  0.00000000

## 2.5 Evaluate CAI for one gene

Write a function, called gene_cai(), that takes a vector number_of_codons and returns the weighted average of codon_adaptness for each codon.

In other words, you must multiply each component of number_of_codons with the corresponding component of codon_adaptness, sum all the results, and then divide by the sum of all number_of_codons. This answer must be a function.

If you could not calculate codon_adaptness in the previous question, then you can use the values for E.coli, that you can download here.

gene_cai <- function(number_of_codons, codon_adaptness) {
# write your code here
}

If everything is right, you should see something like this. The result may have name or not.

gene_cai(genes_codon_count[[1]], codon_adaptness)
##        aaa
## -0.0817776

## 2.6 Calculate CAI for all genes.

Calculate the Codon Adaptation Index for every gene and store these values in a vector called CAI. This is just code.

# write your code here

If everything is right, you should see this result:

hist(CAI, col="grey")

## 2.7 Find the gene with the smallest CAI

Write the code to find the index of the smallest value in the vector CAI. Assign that index in the variable index_of_smallest_CAI.

# write your code here

Now you can see the description of the less adapted gene with the following code

getAnnot(genes)[[index_of_smallest_CAI]]
## [1] ">lcl|NC_000913.3_cds_YP_003933616.1_4159 [gene=mgtL] [protein=regulatory leader peptide for mgtA] [protein_id=YP_003933616.1] [location=4467431..4467484]"