Each question will be answered in a separate file. Please download the answer files for Question 1 and Question 2. Write your name and student number in the correct place at the beginning of each answer file.
Question 0 should be answered immediately, right now. Stop reading now and answer the question. Only continue reading after you delivered the picture. If you do not deliver this right now, your exam will not be graded, and you will get 0.
If you have any issues or questions, write to me at my official email andres.aravena@istanbul.edu.tr
(only for questions, not for answers). When you finish each answer send it to andres.aravena+cmb@istanbul.edu.tr
(only for answers, not for questions). You can send each file one-by-one when you finish it.
Copy the following text in a blank paper. Write it with your own calligraphy. Sign it, take a picture and send it immediately to andres.aravena+cmb@istanbul.edu.tr.
“Şerefim üzerine söz veririm ki, bu sınav sırasında etik kurallari çiǧnemedim”
Full Name:
Student Number:
Signature:
Date:
All answers are strictly personal. Any unethical behavior will be penalized.
Nine hundred years ago the Italian businessman Leonardo Bonacci (known as Fibonacci) created a famous model to predict the number of rabbits to breed. We want to make an improved model, by assuming the following hypotheses:
We want to know how many baby rabbits we will have every month. We start with a single pair of baby rabbits in the first month. In the second month, the rabbits become adults (so there are 0 baby rabbits). In the third month, rabbits are pregnant, so still 0 baby rabbits. On month 4 we get a new pair of baby rabbits. In other words, the number of baby rabbits on any month is equal to the number of pregnant rabbits the month before, which depends on the number of baby rabbits three months earlier.
Doing some abstraction, we can create a function baby_rabbits(n)
to represent the number of pairs of baby rabbits on month n
. This function must follow the following rules:
baby_rabbits(1)
gives 1baby_rabbits(2)
and baby_rabbits(3)
must be 0.n
, the number of baby rabbits is baby_rabbits(n-1) + baby_rabbits(n-3)
Please notice that this is not the typical Fibonacci sequence. This is another sequence.
Write a recursive function in R that has input n
and output baby_rabbits(n)
. You can test your function by checking that at the end of the first year there are 144 pairs of rabbits.
baby_rabbits <- function(n) {
# Write your code here
}
You can test your function by calculating the number of baby rabbits for two years.
bunnies <- rep(NA, 24)
for(n in 1:length(bunnies)) {bunnies[n] <- baby_rabbits(n)}
bunnies
## [1] 1 0 0 1 1 1 2 3 4 6 9 13 19 28 41
## [16] 60 88 129 189 277 406 595 872 1278
The recursive formula is fine but slow. It takes too much time to finish when n
is big. To solve this we can model the rabbits as a system. We have three kinds of rabbits: baby, adult and pregnant rabbits. In each month baby rabbits grow, so the number of adult rabbits increases and the number of baby rabbits decreases. The same happens with adult and pregnant rabbits. On the other side, each of pregnant rabbit produces a new baby rabbit, so the number of baby rabbits increases by the number of pregnant rabbits. We can represent this system by the following figure:
The initial conditions are: baby_ini=1
, adult_ini=0
and pregnant_ini=0
. The rate constants for growth, mating and birth are 1. Please simulate the system for N=24
steps. To make the problem easier, you can use the following skeleton and complete it.
breeding <- function(N, baby_ini, adult_ini, pregnant_ini,
growth_rate, mating_rate, birth_rate) {
baby <- d_baby <- rep(NA, N)
adult <- d_adult <- rep(NA, N)
pregnant <- d_pregnant <- rep(NA, N)
baby[1] <- baby_ini
adult[1] <- adult_ini
pregnant[1] <- pregnant_ini
d_baby[1] <- d_adult[1] <- d_pregnant[1] <- 0
for(i in 2:N) {
# write your code here
baby[i] <- baby[i-1] + d_baby[i]
adult[i] <- adult[i-1] + d_adult[i]
pregnant[i] <- pregnant[i-1] + d_pregnant[i]
}
return(data.frame(Month=1:N, baby, adult, pregnant))
}
If everything is right, the following code should give the same result as the recursive function.
rabbits <- breeding(N=24, baby_ini=1, adult_ini=0, pregnant_ini=0,
growth_rate=1, mating_rate=1, birth_rate=1)
rabbits$baby
## [1] 1 0 0 1 1 1 2 3 4 6 9 13 19 28 41
## [16] 60 88 129 189 277 406 595 872 1278
new_rabbits
What happens if the pregnant rabbits produce more baby rabbits every month, but not all of the babies survive to be adults? For example, what happens if birth_rate
is 2 and growth_rate
is 0.5?
Use the same breeding()
function to simulate a new instance of the system under the new conditions. Store the results in a data frame called new_rabbits
. Draw the following plot showing the column baby
of data frame rabbits
(from previous question) and the same column in data frame new_rabbits
.
# write here
new_rabbits
As with most unbounded growth processes, this system seems to grow exponentially. What is the growth rate?
Write the code to fit a linear model for the column baby
in new_rabbits
and calculate the growth rate from the fitted coefficients. You may need to transform the data.
## (Intercept) Month
## 0.1955867 1.5738819
# write here
An important question in molecular biology is to identify which genes are easier to translate and which ones are harder. One of the most common ways to find genes that are “easy to translate” is to evaluate their Codon Adaptation Index (CAI). High CAI values suggests that the gene is important for cell survival, probably a housekeeping gene. A low CAI values suggests that the gene is seldom used, maybe acquired recently through horizontal transfer.
The CAI is based on the following idea. Most amino-acids are encoded by several codons. For example Valine can be encoded by “gta”, “gtc”, “gtg”, or “gtt”. Each organism has a different preference for which codon to use. The translation mechanism is more efficient for some codons. For example, tRNA genes are different in each organism.
The relative adaptiveness of each codon is the ratio of the usage of each codon, to that of the most abundant codon for the same amino acid. The CAI index is defined as the geometric mean of these relative adaptiveness values. This can be a little hard to calculate, so instead, we will solve a simpler problem, which is mathematically equivalent.
We will calculate the average of the logarithms of each codon’s relative adaptiveness. Non-synonymous codons and termination codons are not counted. For more details, you can read the original paper by Sharp PM et al. (Nucleic Acids Res. 1987;15(3)) or the Wikipedia entry.
In summary:
Here you have the code of the function codon_count()
. You can use it in your answer, you do not need to re-write it.
valid_codons <- c("aaa", "aac", "aag", "aat", "aca", "acc", "acg", "act",
"aga", "agc", "agg", "agt", "ata", "atc", "atg", "att", "caa", "cac",
"cag", "cat", "cca", "ccc", "ccg", "cct", "cga", "cgc", "cgg", "cgt",
"cta", "ctc", "ctg", "ctt", "gaa", "gac", "gag", "gat", "gca", "gcc",
"gcg", "gct", "gga", "ggc", "ggg", "ggt", "gta", "gtc", "gtg", "gtt",
"taa", "tac", "tag", "tat", "tca", "tcc", "tcg", "tct", "tga", "tgc",
"tgg", "tgt", "tta", "ttc", "ttg", "ttt")
codon_count <- function(gene) {
answer <- rep(0, 64)
names(answer) <- valid_codons
for(i in seq(from=1, to=length(gene), by=3)) {
codon <- tolower(paste0(gene[i], gene[i+1], gene[i+2]))
answer[codon] <- answer[codon] + 1
}
return(answer[valid_codons])
}
genes
.Use the same bacteria that was assigned to you in the midterm exam. This is just code, not a function.
# Write your code here
Now you can show the number of genes. This number will be different for each bacteria.
length(genes)
## [1] 4140
genes_codon_count
, containing one vector for each gene in the bacterial genome. Each vector is the result of using codon_count()
on each gene.This is just code, not a function.
genes_codon_count <- list()
# write your code here
The length of genes_codon_count
should be equal to the number of genes. This number will be different for each bacteria.
length(genes_codon_count)
## [1] 4140
and the length of each element should be 64, always.
length(genes_codon_count[[1]])
## [1] 64
total_codon_count
containing the total number of times that each codon appears on all genes.This is just code, not a function.
total_codon_count <- rep(0, 64)
names(total_codon_count) <- valid_codons
# write your code here
Now you can show the number of times that each codon is in all the genes. These numbers will be different for each bacteria.
total_codon_count
## aaa aac aag aat aca acc acg act aga agc agg agt ata
## 44236 28319 13384 22756 8975 30972 18970 11577 2489 21131 1363 11322 5345
## atc atg att caa cac cag cat cca ccc ccg cct cga cgc
## 33331 36700 40171 20208 12814 38152 16937 11058 7138 30969 9128 4523 29301
## cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg
## 6983 27843 5072 14702 70390 14403 52330 25214 23456 42135 26535 33898 44900
## gct gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat
## 19999 10216 39366 14464 32655 14325 20227 34796 24031 2678 16079 287 21055
## tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt
## 9154 11321 11747 10986 1178 8482 20060 6706 18085 21827 17992 29304
This is the critical part of this exam. We need to calculate the relative adaptiveness of each codon, using the values in the total_codon_count
vector.
The adaptiveness is calculated amino-acid by amino-acid. For each amino-acid AA, the adaptiveness of codon C is the logarithm of the count for codon C divided by the maximum count among all the codons encoding for amino-acid AA.
In other words, you can calculate the adaptiveness of every codon, and store only these codons it in the vector codon_adaptness
, by doing the following steps:
codons_for_aa
.total_codon_count
among the codons in codons_for_aa
. Look only at these codons, ignore the rest. Store this maximum in max_count_for_aa
.codons_for_aa
:
total_codon_count
for that particular codon.max_count_for_aa
.codon_adaptness
using the codon as the index.codons_for_aa
.The result will be the vector codon_adaptness
that you will use in the following questions. This is just code, not a function. It is recommended to use the following list, that describes which codons encode each amino-acid.
genetic_code <- list(Ala = c("gca", "gcc", "gcg", "gct"),
Arg = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),
Asn = c("aac", "aat"), Asp = c("gac", "gat"),
Cys = c("tgc", "tgt"), Gln = c("caa", "cag"),
Glu = c("gaa", "gag"), Gly = c("gga", "ggc", "ggg", "ggt"),
His = c("cac", "cat"), Ile = c("ata", "atc", "att"),
Leu = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),
Lys = c("aaa", "aag"), Phe = c("ttc", "ttt"),
Pro = c("cca", "ccc", "ccg", "cct"),
Ser = c("agc", "agt", "tca", "tcc", "tcg", "tct"),
Thr = c("aca", "acc", "acg", "act"),
Tyr = c("tac", "tat"), Val = c("gta", "gtc", "gtg", "gtt"))
codon_adaptness <- rep(0, 64)
names(codon_adaptness) <- valid_codons
# Write your code here
The result should look something like this.
codon_adaptness
## aaa aac aag aat aca acc
## 0.00000000 0.00000000 -1.19547897 -0.21870411 -1.23864064 0.00000000
## acg act aga agc agg agt
## -0.49022479 -0.98406320 -2.46574062 0.00000000 -3.06793349 -0.62399342
## ata atc atg att caa cac
## -2.01698380 -0.18665745 0.00000000 0.00000000 -0.63549962 -0.27896225
## cag cat cca ccc ccg cct
## 0.00000000 0.00000000 -1.02983256 -1.46755408 0.00000000 -1.22164009
## cga cgc cgg cgt cta ctc
## -1.86844615 0.00000000 -1.43414302 -0.05104006 -2.63031599 -1.56606767
## ctg ctt gaa gac gag gat
## 0.00000000 -1.58661469 0.00000000 -0.51347935 -0.80244349 0.00000000
## gca gcc gcg gct gga ggc
## -0.52597318 -0.28108178 0.00000000 -0.80875552 -1.34894738 0.00000000
## ggg ggt gta gtc gtg gtt
## -1.00123970 -0.18690452 -0.88749618 -0.54248409 0.00000000 -0.37015777
## taa tac tag tat tca tcc
## 0.00000000 -0.26962399 0.00000000 0.00000000 -0.83655021 -0.62408175
## tcg tct tga tgc tgg tgt
## -0.58714327 -0.65411942 0.00000000 0.00000000 0.00000000 -0.23494362
## tta ttc ttg ttt
## -1.35896834 -0.29457629 -1.36412399 0.00000000
Write a function, called gene_cai()
, that takes a vector number_of_codons
and returns the weighted average of codon_adaptness
for each codon.
In other words, you must multiply each component of number_of_codons
with the corresponding component of codon_adaptness
, sum all the results, and then divide by the sum of all number_of_codons
. This answer must be a function.
If you could not calculate codon_adaptness
in the previous question, then you can use the values for E.coli, that you can download here.
gene_cai <- function(number_of_codons, codon_adaptness) {
# write your code here
}
If everything is right, you should see something like this. The result may have name or not.
gene_cai(genes_codon_count[[1]], codon_adaptness)
## aaa
## -0.0817776
Calculate the Codon Adaptation Index for every gene and store these values in a vector called CAI
. This is just code.
# write your code here
If everything is right, you should see this result:
hist(CAI, col="grey")
Write the code to find the index of the smallest value in the vector CAI
. Assign that index in the variable index_of_smallest_CAI
.
# write your code here
Now you can see the description of the less adapted gene with the following code
getAnnot(genes)[[index_of_smallest_CAI]]
## [1] ">lcl|NC_000913.3_cds_YP_003933616.1_4159 [gene=mgtL] [protein=regulatory leader peptide for mgtA] [protein_id=YP_003933616.1] [location=4467431..4467484]"