Blog of Andrés Aravena
CMB2:

# Homework 11

12 May 2019. Deadline: Friday, 17 May, 9:00.

# 1. GC content

Now that we have more powerful tools, we will explore again the GC content of Carsonella rudii. First, we will see that GC content can be seen as the average of an event.

## 1.1 Is this nucleotide G or C?

Please write a function called is_G_or_C that determines if a nucleotide is either “G” or “C”. The function takes a character vector (called dna) and returns a logic vector called ans. If dna[i] is “G”, “g”, “C” or “c”, then ans[i] is TRUE, otherwise it is FALSE.

Test your function with the following code:

[1] FALSE  TRUE FALSE  TRUE
[1] FALSE  TRUE FALSE  TRUE
[1] 0.5

As you can see, the GC content can be seen as the mean of the event is_G_or_C(). It is natural to think that we can also calculate the variance and the standard deviation. How can we do that?

## 1.2 Simulating some genes

We will use this function over simulated data. We need 100 genes of different lengths. Please create a vector, called sim_gene_length, with 100 elements taken randomly (with replacement) from the values 100, 103, 106, …, 1201. Then create a list, called sim_genes, with “random DNA sequences”1. Each element sim_genes[[i]] must be a random character vector with length sim_gene_length[i]. Only the letters “A”,“C”,“G”, and “T” are valid, and all have the same probability.

After you have created sim_genes, calculate the mean of is_G_or_C() for each gene and store them in the vector sim_mean_GC. You also need to calculate the variance of is_G_or_C() for each gene and store the result in sim_var_GC.

Finally, calculate the standard error of the mean by taking the square root of sim_var_GC divided by sim_gene_length, and store it in sim_std_err_mean.

Test your results with this code:

## 1.3 Genes versus genome

Since all genes are taking from a random population, and all nucleotides have the same probability, then in this case we know that the genome GC content2 is 0.5.

But in many cases we do not have the full genome, just some genes. We want use the information of each gene to build a confidence interval for the genome GC content.

Fortunately we know that, if the genes are random DNA sequences, the GC content of the genes is not too different from the GC content of the genome3. Moreover, we know that the GC content of the genes is a random variable following the Normal distribution4.

Unfortunately, we do not know the standard deviation of the population (we are using each gene as a sample). Instead, we can approximate it with the standard error of each gene5, but we will have to use the Student’s t distribution.

We start by choosing the confidence level $$1-\alpha=95\%$$, therefore alpha is 0.05. You must use the function qt() to find the number of standard errors k that will give you the confidence interval. Remember that the number of degrees of freedom is sim_gene_length-1.

Then you must calculate the vectors sim_lower, with the lower limits of each confidence interval, and sim_upper, with the upper limits.

Now you can count how many of the confidence intervals really contained the real value. In other words, you can see how good was the prediction.

[1] 0.96

You can see your results with this code:

The previous plot is hard to see, because there is a lot of variability. It can be a little easier if we use the function order() to sort the values, as in the following code:

# 2. Apply to real data

Now we will use this function over the genes of Carsonella rudii.

## 2.1 Read the genes from a file

Using the library seqinr, please write the code to

• read the FASTA file and load all genes in a list called genes,
• calculate the mean of is_G_or_C() for each gene, and store each value in a vector called mean_is_G_or_C. The length of this vector must be the number of genes
• calculate the variance of is_G_or_C() for each gene, and store each value in a vector called var_is_G_or_C.
• calculate the length of each gene, and store each value in a vector called gene_length.
• calculate the standard error of the mean for each gene. That is, the square root of the variance divided by the gene length.

Test your results with this code:

[1] 182

## 2.2 Using genes to estimate genome GC content

We want use the information of the real genes to build a confidence interval for the genome GC content.

Since the values have much more variability than in the simulation, we choose a confidence level of $$1-\alpha=99.9\%$$, therefore alpha is 0.001. You must use the function qt() to find the number of standard errors k that will give you the confidence interval. Remember that the number of degrees of freedom is gene_length-1.

Then you must calculate the vectors lower, with the lower limits of each confidence interval, and upper, with the upper limits.

Test your results with this code:

The previous plot is hard to see, because there is a lot of variability. It can be a little easier if we use the function order() to sort the values, as in the following code:

1. Notice that sim_gene_length is a vector, since all elements are numbers. On the other side sim_genes has to be a list, since every element is itself a vector, and each one can have different length. We cannot handle this kind of data in a vector, it has to be a list.

2. do you agree?

3. We know this because GC content is an average so they obey the Law of Large Numbers.

4. This is true because of the Central Limit Theorem.

5. Standard error of the sample is the standard deviation of the sample divided by the square root of the sample size $\sqrt{\frac{\text{var}(x)}{\text{length}(x)}}$