Ethan Agranoff


Week 1: Link Discussion

Overview

With little knowledge of bioinformatics this early in the quarter, this week is meant more to start looking at links and information on COVID-19.


The GISAID Initiative

  • GISAID (Global Initiative on Sharing ALl Influenza Data) is a nonprofit organization whose goal is to provide public access to genetic sequence data of influenza viruses.
  • The database, named EpiFlu, is hosted by the German government and is mainly a German operation, but its services are intended for worldwide use.
  • They allow public uploading of genetic sequences and share data.
  • With the rise of COVID-19, they now have expanded to include data on the novel coronavirus.
  • The website includes multiple graphics analyzing the data submitted for various genome sequences of COVID-19, provided by Nextstrain.
  • The EpiFlu database might be a useful resource for acquiring real time data.

Folding@Home

  • The “Folding” refers to the folding of proteins.
  • By having people volunteer personal computing power, the project is about running simulations of protein dynamics to better understand various biological processes.
  • Viruses have proteins that are used to suppress our immune systems as well as reproducing themselves.
  • Folding@Home is trying to redirect its research towards COVID-19 and better understanding its protein structure.
  • This can lead to discovering therapeutic techniques related to coronavirus.

Nextstrain

  • Nextstrain is an open source project that is meant to help track pathogen evolution using public health data.
  • They have a lot of graphics on the evolution of COVID-19, using data from GISAID’s EpiFlu database.
  • Good visualization tool.

COVID-19 Analysis on usegalaxy

  • A helpful resource that has workflows to use for analysis of COVID-19.
  • There are 3 different workflows for different types of analysis: Genomics, Evolution, and Cheminformatics.
  • This can be helpful analyzing as a student since background knowledge is lacking/developing on bioinformatics.

Week 2

Week 2: Looking at Some Data


Overview

Some guidance on what to take a look at:

  • Consider what questions we want to ask from our evolutionary tree analysis. Think about what questions the book was trying to answer. Do we even have the data in this notebook to answer some of those questions? If not, spend time trying to find it now that you can know more about what to look for in terms of format. Do some literature searching and see what other work has been done for this virus and others.
  • Research and try different evolutionary tree programs/frameworks. What I’ve done below is not the only game in town by far. Biopython itself has different options.
  • Consider the alignment itself. Are there different ways to do this? Did we do it correctly?
  • What about the sequences themselves? Are they all of the same quality? Should we exclude some?
  • What about the virus alignment program? Did we use that correctly? Should we have done the entire sequence instead of using Spike as a reference? Should we try a different reference.
  • Do we have more data available about the sequences? Part of world, etc. Can we do some digging here to answer different questions.
  • And I’m sure you can think of more to attempt… Think about what you want to do. Spend time working towards a well thoughtout goal. Document things as you go. Talk to everyone on Slack. Together we can do this!

Getting some Initial Data

import wget

url = 'https://covid19.galaxyproject.org/genomics/4-Variation/current_complete_ncov_genomes.fasta'
file = 'data/current_complete_ncov_genomes.fasta'
#wget.download(url, file)

Getting an alignment (sequence MT350234.1 had an error):

#!virulign ../../virulign/references/SARS-CoV-2/S.xml data/current_complete_ncov_genomes.fasta --exportAlphabet Nucleotides --exportKind PositionTable > data/position_table.csv

Some Initial Work

Read the Data into a pandas DataFrame

import pandas as pd
position_table = pd.read_csv('data/position_table.csv')
results = position_table.describe()
display(position_table)
display(results)
seqid S_1_1 S_1_2 S_1_3 S_2_1 S_2_2 S_2_3 S_3_1 S_3_2 S_3_3 ... S_1270_3 S_1271_1 S_1271_2 S_1271_3 S_1272_1 S_1272_2 S_1272_3 S_1273_1 S_1273_2 S_1273_3
0 MT007544.1 A T G T T T G T T ... A C A T T A C A C A
1 MT019529.1 A T G T T T G T T ... A C A T T A C A C A
2 MT019530.1 A T G T T T G T T ... A C A T T A C A C A
3 MT019531.1 A T G T T T G T T ... A C A T T A C A C A
4 MT019532.1 A T G T T T G T T ... A C A T T A C A C A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
815 MT350277.1 A T G T T T G T T ... A C A T T A C A C A
816 MT350278.1 A T G T T T G T T ... A C A T T A C A C A
817 MT350279.1 A T G T T T G T T ... A C A T T A C A C A
818 MT350280.1 A T G T T T G T T ... A C A T T A C A C A
819 MT350282.1 A T G T T T G T T ... A C A T T A C A C A

820 rows × 3820 columns

seqid S_1_1 S_1_2 S_1_3 S_2_1 S_2_2 S_2_3 S_3_1 S_3_2 S_3_3 ... S_1270_3 S_1271_1 S_1271_2 S_1271_3 S_1272_1 S_1272_2 S_1272_3 S_1273_1 S_1273_2 S_1273_3
count 820 820 820 820 820 820 820 820 820 820 ... 820 820 820 820 820 820 820 820 820 820
unique 820 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
top MT293188.1 A T G T T T G T T ... A C A T T A C A C A
freq 1 820 820 820 820 820 820 820 820 820 ... 820 820 820 820 820 820 820 820 820 820

4 rows × 3820 columns

Pull Out the Consensus Sequence

Consensus Sequence
a sequence of DNA having similar structure and function in different organisms.
consensus_seq = position_table.drop('seqid',axis=1).mode(axis=0).T[0]
display(consensus_seq)
position_table.set_index('seqid',inplace=True)
S_1_1       A
S_1_2       T
S_1_3       G
S_2_1       T
S_2_2       T
           ..
S_1272_2    A
S_1272_3    C
S_1273_1    A
S_1273_2    C
S_1273_3    A
Name: 0, Length: 3819, dtype: object

Utilizing Biopython for Initial Analysis

Determine which samples are farthest from the concensus sequence

distance_from_consensus_seq = position_table.apply(lambda row: sum(row != consensus_seq),axis=1)
distance_from_consensus_seq_sorted = distance_from_consensus_seq.sort_values(ascending=False)
display(distance_from_consensus_seq_sorted)
seqid
MT345850.1    263
MT345852.1    141
MT345843.1    122
MT345884.1     97
MT233522.1     82
             ... 
MT263449.1      0
MT263451.1      0
MT263452.1      0
MT263454.1      0
MT263422.1      0
Length: 820, dtype: int64

Select 10 sequences to do our first analysis

subset_seqs = distance_from_consensus_seq_sorted[:10].index
display(subset_seqs)
Index(['MT345850.1', 'MT345852.1', 'MT345843.1', 'MT345884.1', 'MT233522.1',
       'MT308696.1', 'MT308694.1', 'MT263453.1', 'MT259284.1', 'MT293180.1'],
      dtype='object', name='seqid')

Construct a distance matrix for the sequences

distances = {}
for i,seqid1 in enumerate(subset_seqs):
    distances[seqid1,seqid1] = 0
    for j in range(i+1,len(subset_seqs)):
        seqid2 = subset_seqs[j]
        distances[seqid1,seqid2] = sum(position_table.loc[seqid1] != position_table.loc[seqid2])
        distances[seqid2,seqid1] = distances[seqid1,seqid2]
distances = pd.Series(distances).unstack()
display(distances)
MT233522.1 MT259284.1 MT263453.1 MT293180.1 MT308694.1 MT308696.1 MT345843.1 MT345850.1 MT345852.1 MT345884.1
MT233522.1 0 115 130 104 135 151 202 345 221 177
MT259284.1 115 0 81 47 86 104 107 230 108 94
MT263453.1 130 81 0 68 101 119 134 299 189 143
MT293180.1 104 47 68 0 77 93 130 273 153 89
MT308694.1 135 86 101 77 0 22 175 286 194 150
MT308696.1 151 104 119 93 22 0 191 288 210 166
MT345843.1 202 107 134 130 175 191 0 211 77 115
MT345850.1 345 230 299 273 286 288 211 0 164 276
MT345852.1 221 108 189 153 194 210 77 164 0 136
MT345884.1 177 94 143 89 150 166 115 276 136 0

Constructing the Distance Matrix

from Bio.Phylo.TreeConstruction import DistanceMatrix
import numpy as np
matrix = np.tril(distances.values).tolist()
for i in range(len(matrix)):
    matrix[i] = matrix[i][:i+1]
dm = DistanceMatrix(list(distances.index),matrix)

Now construct the tree

from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)

Drawing the tree

%matplotlib inline

from Bio import Phylo
tree.ladderize() # Flip branches so deeper clades are displayed at top
Phylo.draw(tree)

svg


Trying Other DNA Alignment Tools

With knowledge that Biopython has built-in shortcuts to run some popular alignment tools, I thought it would be interesting to see how it differed from Virulign.

ClustalW, one of the DNA alignment tools that Biopython can use, takes way too long to run. It ran overnight and still didn’t finish, so it’s not a very viable option without a better understanding of parameters.

from Bio.Align.Applications import ClustalwCommandline

cmd = ClustalwCommandline('clustalw2',infile=cov_seq_file)
print(cmd)
#cmd()
clustalw2 -infile=./data/current_complete_ncov_genomes.fasta

MUSCLE is another one, but it runs out of memory when trying to run it on my PC.

from Bio.Align.Applications import MuscleCommandline

cmd = MuscleCommandline(input=cov_seq_file,out='.'+cov_seq_file.split('.')[1]+'_aligned.fasta')
print(cmd)
#cmd()
muscle -in ./data/current_complete_ncov_genomes.fasta -out ./data/current_complete_ncov_genomes_aligned.fasta

After searching the internet, it was hard to find any tool that could align the entire file (~24MB in size) online, which was pretty much the only other popular options available for Windows.



Week 3

How similar is COVID-19 to Other Coronaviruses?

Consider the following coronavirus genomes:

  1. SARS coronavirus

  2. Bat SARS-like coronavirus isolate

  3. COVID-19

import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from Bio import SeqIO,AlignIO,Seq

sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(40,20),'axes.titlesize':46, 'axes.labelsize':32, 'xtick.labelsize': 24, 'ytick.labelsize': 24})

Comparing using k-mer Composition

Find the normalized frequencies of each distinct 3-mer

def get_normalized_freq(seq,kmers):
    '''Find the normalized frequency of every distinct k-mer for a DNA sequence.'''
    total_kmers = len(seq) - (len(kmers[0]) + 1)
    normalized_freq = [(seq.count(kmer) / total_kmers) for kmer in kmers]
    return normalized_freq

Similar to Lab 3, get all the 3mers in the DNA sequence

trimers = [''.join(word) for word in itertools.product('ACGT',repeat=3)]
print(trimers) # All distinct 3-mers
['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT']

SARS

Using a single complete genome (ZJ0301)

sars_seq_file = './data/SARSsequence.fasta'

sars_seq = list(SeqIO.parse(sars_seq_file,"fasta"))[0]
print(sars_seq)
ID: DQ182595
Name: DQ182595
Description: DQ182595 |SARS coronavirus ZJ0301 from China| complete genome
Number of features: 0
Seq('TACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGA...AGA', SingleLetterAlphabet())
sars_freq = get_normalized_freq(str(sars_seq.seq),trimers)
print(sars_freq)
[0.018045922833479226, 0.01811325836643997, 0.01892128476196889, 0.0218840482122416, 0.024106120799946132, 0.013063093394384216, 0.005218503804457613, 0.02205238704464346, 0.016665544407783988, 0.011817386034610465, 0.013972123089354253, 0.015083159383206517, 0.012827419029021615, 0.011312369537404889, 0.02612618678876843, 0.024409130698269478, 0.024409130698269478, 0.014982156083765403, 0.01491482055080466, 0.018685610396606288, 0.01333243552622719, 0.0042421385765268336, 0.0030300989832334523, 0.011615379435728233, 0.004713487307252037, 0.0042421385765268336, 0.0027607568513904786, 0.0071375664938387985, 0.01871927816308666, 0.011615379435728233, 0.018954952528449264, 0.02407245303346576, 0.015285165982088749, 0.012322402531816039, 0.01252440913069827, 0.015588175880412093, 0.014477139586559828, 0.007844589589926605, 0.004679819540771665, 0.020941350750791194, 0.01175005050164972, 0.009864655578748906, 0.004309474109487577, 0.01390478755639351, 0.014746481718402801, 0.009999326644670393, 0.017406235270352163, 0.01969564339101744, 0.019224294660292237, 0.01989764998989967, 0.011783718268130093, 0.01828159719884183, 0.020200659888223015, 0.007070230960878055, 0.005858191367584674, 0.018315264965322202, 0.022018719278163085, 0.02208605481112383, 0.018786613696047404, 0.02471214059659282, 0.02309608780553498, 0.018954952528449264, 0.02612618678876843, 0.02023432765470339]

Graph normalized frequencies

sars_df = pd.DataFrame({'Trimer':trimers,'Normalized Frequency':sars_freq})
display(sars_df)

plt.xticks(rotation=60)
sars_plt = sns.lineplot(x='Trimer', y='Normalized Frequency', lw=5, data=sars_df)
sars_plt.axes.set_title('Trimer Composition of SARS coronavirus ZJ0301');
Trimer Normalized Frequency
0 AAA 0.018046
1 AAC 0.018113
2 AAG 0.018921
3 AAT 0.021884
4 ACA 0.024106
... ... ...
59 TGT 0.024712
60 TTA 0.023096
61 TTC 0.018955
62 TTG 0.026126
63 TTT 0.020234

64 rows × 2 columns

svg

Bat SARS-like Coronavirus Isolate (bat-SL-CoVZC45)

Using a single complete genome

bat_seq_file = './data/bat-SL-CoVZC45sequence.fasta'

bat_seq = list(SeqIO.parse(bat_seq_file,"fasta"))[0]
print(bat_seq)
ID: MG772933.1
Name: MG772933.1
Description: MG772933.1 Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome
Number of features: 0
Seq('ATATTAGGTTTTTACCTTCCCAGGTAACAAACCAACTAACTCTCGATCTCTTGT...AAA', SingleLetterAlphabet())
bat_freq = get_normalized_freq(str(bat_seq.seq),trimers)
print(bat_freq)
[0.020471172561916908, 0.019162359889925498, 0.019028122692798174, 0.024464729176454796, 0.023793543190818174, 0.012517618632122961, 0.00640982616282972, 0.022417611920263106, 0.018692529699979865, 0.0100677897845493, 0.012249144237868314, 0.01681320894019733, 0.014866769581851132, 0.010940331565876904, 0.024062017585072825, 0.02567286395060071, 0.02288744211020874, 0.014698973085441975, 0.014665413786160144, 0.016376938049533527, 0.012416940734277468, 0.0036244043224377474, 0.002986777636082959, 0.010738975770185918, 0.003926438015974226, 0.0036244043224377474, 0.0031545741324921135, 0.006141351768575072, 0.017618632122961272, 0.00943016309819451, 0.017283039130142962, 0.02527015235921874, 0.017283039130142962, 0.011980669843613666, 0.011041009463722398, 0.014698973085441975, 0.013356601114168736, 0.00640982616282972, 0.0031545741324921135, 0.018021343714343243, 0.010604738573058594, 0.008121350426203102, 0.004295590308074367, 0.01516880327538761, 0.015638633465333243, 0.009530840996040003, 0.017350157728706624, 0.02258540841667226, 0.0225182898181086, 0.0195986307805893, 0.013759312705550707, 0.019699308678434793, 0.018759648298543527, 0.007215249345593664, 0.004161353110947044, 0.01788710651721592, 0.021108799248271696, 0.01926303778777099, 0.01849117390428888, 0.02627693133767367, 0.027719981206792404, 0.018658970400698034, 0.027149473119001274, 0.022652527015235922]
bat_df = pd.DataFrame({'Trimer':trimers,'Normalized Frequency':bat_freq})
display(bat_df)

plt.xticks(rotation=60)
bat_plt = sns.lineplot(x='Trimer', y='Normalized Frequency', lw=5, color='green', data=bat_df)
bat_plt.axes.set_title('Trimer Composition of bat-SL-CoVZC45');
Trimer Normalized Frequency
0 AAA 0.020471
1 AAC 0.019162
2 AAG 0.019028
3 AAT 0.024465
4 ACA 0.023794
... ... ...
59 TGT 0.026277
60 TTA 0.027720
61 TTC 0.018659
62 TTG 0.027149
63 TTT 0.022653

64 rows × 2 columns

svg

COVID-19

Using complete genome (reference sequence)

cov_seq_file = './data/COVID19refsequence.fasta'

cov_seq = list(SeqIO.parse(cov_seq_file,"fasta"))[0]
print(cov_seq)
ID: NC_045512.2
Name: NC_045512.2
Description: NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
Number of features: 0
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', SingleLetterAlphabet())
cov_freq = get_normalized_freq(str(cov_seq.seq),trimers)
print(cov_freq)
[0.02187364125890498, 0.020569249807685874, 0.019398642095053348, 0.025452356266095856, 0.025251680658215994, 0.012575671427138032, 0.0054851332820495665, 0.022542559951837855, 0.01906418274858691, 0.010067226328639755, 0.011003712498745778, 0.016957088865848357, 0.014983778721696378, 0.011338171845212214, 0.02424830261881668, 0.02585370748185558, 0.02351249205659052, 0.014247968159470216, 0.01464931937522994, 0.01618783236897555, 0.01183986086491187, 0.0034114853339576577, 0.002474999163851634, 0.011505401518445433, 0.0031773637914311514, 0.003143917856784508, 0.0025418910331449214, 0.005719254824576072, 0.018763169336767117, 0.008996956419947156, 0.016555737650088633, 0.02468309976922305, 0.01789357503595438, 0.011371617779858859, 0.009431753570353524, 0.014716211244523228, 0.012441887688551456, 0.006254389778922372, 0.0028429044449647146, 0.017425331950901367, 0.009431753570353524, 0.007458443426201545, 0.003980066222950601, 0.01518445432957624, 0.015686143349275896, 0.008996956419947156, 0.017324994146961436, 0.02341215425265059, 0.02404762701093682, 0.020368574199806012, 0.01428141409411686, 0.019900331114753003, 0.01836181812100739, 0.006990200341148533, 0.0037793906150707384, 0.017023980735141643, 0.021070938827385532, 0.018294926251714104, 0.01852904779424061, 0.026790193651961603, 0.02929863875045988, 0.017324994146961436, 0.027325328606307903, 0.023713167664470385]
cov_df = pd.DataFrame({'Trimer':trimers,'Normalized Frequency':cov_freq})
display(cov_df)

plt.xticks(rotation=60)
cov_plt = sns.lineplot(x='Trimer', y='Normalized Frequency', lw=5, color='red', data=cov_df)
cov_plt.axes.set_title('Trimer Composition of COVID-19 Ref Sequence NC_045512');
Trimer Normalized Frequency
0 AAA 0.021874
1 AAC 0.020569
2 AAG 0.019399
3 AAT 0.025452
4 ACA 0.025252
... ... ...
59 TGT 0.026790
60 TTA 0.029299
61 TTC 0.017325
62 TTG 0.027325
63 TTT 0.023713

64 rows × 2 columns

svg


Comparing/Overlapping All 3 Trimer Compositions

plt.xticks(rotation=60)
sns.lineplot(x='Trimer', y='Normalized Frequency', lw=2, label='SARS', data=sars_df)
sns.lineplot(x='Trimer', y='Normalized Frequency', lw=2, color='green', label='bat-SL-CoV', data=bat_df)
full_plt = sns.lineplot(x='Trimer', y='Normalized Frequency', lw=2, color='red', label='COVID-19', data=cov_df)
full_plt.axes.set_title('Trimer Composition of Different Coronaviruses')
full_plt.legend(fontsize=22);

svg

From looking at this final graph, you can see very similar genome patterns among the 3 viruses.

What’s even more noteworthy is how the bat-SL-CoV virus is much more similar to COVID-19 than the SARS strand is.

The graph above lines up with the findings from an article which discusses the similarities between bt-SL-CoVZC45 and COVID-19, and hypothesizes that this bat SARS-like coronavirus may be a link between the 2.


Comparing COVID-19 to non-coronavirus diseases

I also wanted to take a look at how the 3-mer composition of COVID-19 relates to other diseases. This includes chicken pox, a viral disease that is not a coronavirus, and tuberculosis, a bacterial disease.

tb_seq_file = './data/TBsequence.fasta'

tb_seq = list(SeqIO.parse(tb_seq_file,"fasta"))[0]
print(tb_seq)
ID: AL123456.3
Name: AL123456.3
Description: AL123456.3 Mycobacterium tuberculosis H37Rv complete genome
Number of features: 0
Seq('TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCC...TCG', SingleLetterAlphabet())
hhv_seq_file = './data/HHV3sequence.fasta'

hhv_seq = list(SeqIO.parse(hhv_seq_file,"fasta"))[0]
print(hhv_seq)
ID: NC_001348.1
Name: NC_001348.1
Description: NC_001348.1 Human herpesvirus 3, complete genome
Number of features: 0
Seq('AGGCCAGCCCTCTCGCGGCCCCCTCGAGAGAGAAAAAAAAAAGCGACCCCACCT...AGG', SingleLetterAlphabet())
tb_freq = get_normalized_freq(str(tb_seq.seq),trimers)
hhv_freq = get_normalized_freq(str(hhv_seq.seq),trimers)

tb_df = pd.DataFrame({'Trimer':trimers,'Normalized Frequency':tb_freq})
hhv_df = pd.DataFrame({'Trimer':trimers,'Normalized Frequency':hhv_freq})
plt.xticks(rotation=60)
sns.lineplot(x='Trimer', y='Normalized Frequency', lw=2, label='TB', data=tb_df)
sns.lineplot(x='Trimer', y='Normalized Frequency', lw=2, color='green', label='Chicken Pox', data=hhv_df)
full_plt = sns.lineplot(x='Trimer', y='Normalized Frequency', lw=2, color='red', label='COVID-19', data=cov_df)
full_plt.axes.set_title('Trimer Composition of COVID-19 vs Other Diseases')
full_plt.legend(fontsize=22);

svg

Now comparing this to the other graph, it’s clear that even another kind of virus is completely different when comparing nucleotide composition, further showing connection between COVID-19 and other coronaviruses, including SARS.


How similar is COVID-19 to Other Coronaviruses?

Consider the following coronavirus genomes:

  1. SARS coronavirus

  2. Bat SARS-like coronavirus isolate

  3. COVID-19

This notebook will continue comparing these 3 coronavirus strands as a follow up to the Trimer Composition.

import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from Bio import SeqIO,AlignIO,Seq

sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(40,20),'axes.titlesize':46, 'axes.labelsize':32, 'xtick.labelsize': 24, 'ytick.labelsize': 24})

Comparing GC content

GC content (guanine-cytosine content) is the proportion of nucleotides:

def get_gc_comp(seq):
    total_nucleotides = len(seq)
    G_count,C_count = seq.count('G'),seq.count('C')
    return (G_count + C_count) / total_nucleotides * 100

SARS

Using a single complete genome (ZJ0301)

sars_seq_file = './data/SARSsequence.fasta'

sars_seq = list(SeqIO.parse(sars_seq_file,"fasta"))[0]
print(sars_seq)
ID: DQ182595
Name: DQ182595
Description: DQ182595 |SARS coronavirus ZJ0301 from China| complete genome
Number of features: 0
Seq('TACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGA...AGA', SingleLetterAlphabet())
sars_gc = get_gc_comp(sars_seq.seq)
print(sars_gc)
40.80320473978321

Bat SARS-like Coronavirus Isolate (bat-SL-CoVZC45)

Using a single complete genome

bat_seq_file = './data/bat-SL-CoVZC45sequence.fasta'

bat_seq = list(SeqIO.parse(bat_seq_file,"fasta"))[0]
print(bat_seq)
ID: MG772933.1
Name: MG772933.1
Description: MG772933.1 Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome
Number of features: 0
Seq('ATATTAGGTTTTTACCTTCCCAGGTAACAAACCAACTAACTCTCGATCTCTTGT...AAA', SingleLetterAlphabet())
bat_gc = get_gc_comp(bat_seq.seq)
print(bat_gc)
38.90342930004698

COVID-19

Using complete genome of reference sequence NC_045512

cov_seq_file = './data/COVID19refsequence.fasta'

cov_seq = list(SeqIO.parse(cov_seq_file,"fasta"))[0]
print(cov_seq)
ID: NC_045512.2
Name: NC_045512.2
Description: NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
Number of features: 0
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', SingleLetterAlphabet())
cov_gc = get_gc_comp(cov_seq.seq)
print(cov_gc)
37.97277865097147

It looks like COVID-19 is very similar to the bat coronavirus (less than 1% difference), but comparing all 3 to other diseases might help solidify the similarities in GC content.

###

# TB
tb_seq_file = './data/TBsequence.fasta'

tb_seq = list(SeqIO.parse(tb_seq_file,"fasta"))[0]
print(tb_seq)
tb_gc = get_gc_comp(tb_seq.seq)
print(tb_gc,'\n')

# Chicken pox virus
hhv_seq_file = './data/HHV3sequence.fasta'

hhv_seq = list(SeqIO.parse(hhv_seq_file,"fasta"))[0]
print(hhv_seq)
hhv_gc = get_gc_comp(hhv_seq.seq)
print(hhv_gc)
ID: AL123456.3
Name: AL123456.3
Description: AL123456.3 Mycobacterium tuberculosis H37Rv complete genome
Number of features: 0
Seq('TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCC...TCG', SingleLetterAlphabet())
65.61466628826449 

ID: NC_001348.1
Name: NC_001348.1
Description: NC_001348.1 Human herpesvirus 3, complete genome
Number of features: 0
Seq('AGGCCAGCCCTCTCGCGGCCCCCTCGAGAGAGAAAAAAAAAAGCGACCCCACCT...AGG', SingleLetterAlphabet())
46.02030684475193

Now time to look at the GC content side by side via Bar Graph

df = pd.DataFrame([cov_gc,bat_gc,sars_gc,hhv_gc,tb_gc],index=['COVID-19','bat-SL-CoVZC45','SARS','Varicella (chicken pox)','Tuberculosis']).T
df
COVID-19 bat-SL-CoVZC45 SARS Varicella (chicken pox) Tuberculosis
0 37.972779 38.903429 40.803205 46.020307 65.614666
bar_plt = sns.barplot(data=df)
bar_plt.set(ylabel='GC Content',title='GC Content of Diseases');

svg

The progression of similarity between the different diseases is apparent. The viruses are all much closer in similarity than the bacterial disease, the coronavirus strands are even more similar than the other virus, and COVID-19/bat-SL-CoVZC45 are even closer related in terms of GC content.

This shows more evidence about the findings that COVID-19 could have evolved from this bat coronavirus.


Week 4

  • For this week, I tried to do some work based on this NYT article. In the hopes that I would be able to import and visualize the findings from the article, I unfortunately could not replicate the findings of the article. Maybe I’m missing important pieces of understanding of DNA/RNA sequences, but nonetheless, I spent a lot of time towards what was basically a deadend.