First, I want to download the various (complete) Coronavirus genomes. It might be useful later.

From the NCBI database I downloaded a .yaml file (available here) which contained the "accession" ID of every sequence in their database. Using these IDs I can download each sequence using Biopython. I have to do some cleanup of the file so I can get the IDs I want.

In [164]:
import Bio as bp
In [78]:
ids = []
with open("ncov-sequences.yaml","r") as f:
    lines = [line.strip() for line in f.readlines() if line not in 'genbank-sequences:\n'][1:]
In [133]:
# get indices of all lines with 'accession' so we know where to slice list
indices = [i for i, line in enumerate(lines) if '- accession' in line]
Out[133]:
614
In [145]:
# get all sequences
sequences = []
for i in range(len(indices)):
    try:
        sequences.append(lines[indices[i]:indices[i+1]])
    except IndexError:
        sequences.append(lines[indices[i]:indices[i]+6])
In [174]:
# filter for sequences that are complete
complete_sequences = []
for lst in sequences:
    for line in lst:
        if "gene-region: complete" in line:
            complete_sequences.append(lst)
In [171]:
import re

# use regex to grab the ids of the complete sequences
ids = []
for i in range(len(indices)):
    match = re.search('accession: [A-Z]{2}[\d]{6}',lines[indices[i]])
    if match:
        ids.append(match.group(0)[-8:])

The above code was probably pretty ugly, but now we have a list of IDs that we can query the database with.

In [ ]:
import os
from Bio import SeqIO
from Bio import Entrez

for _id in ids[:10]: # remove the slice notation to get all IDs. This will only grab first 10 sequences. 
    Entrez.email = "lepopal@calpoly.edu"  
    filename = str(_id)+".gbk"
    if not os.path.isfile("sequences/" + filename):
        # Downloading...
        net_handle = Entrez.efetch(
            db="nucleotide", id=_id, rettype="gb", retmode="text"
        )
        out_handle = open("sequences/" + filename, "w")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved:", filename)
    else:
        print("Already exists:", filename)

Now we have all the files. We can load them all into Biopython.

In [198]:
files = [file for file in os.listdir("sequences/") if file.endswith(".gbk")]

for fname in files:
    for seq_record in SeqIO.parse("sequences/" + fname, "genbank"):
        print(seq_record.id) # prints the ID of the sequence
        print(repr(seq_record.seq)) # prints the nucleotide sequence
        print(len(seq_record)) # prints the length of the nucleotide sequence
MN938388.1
Seq('AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCT...TTT', IUPACAmbiguousDNA())
107
MN975263.1
Seq('TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCC...CTT', IUPACAmbiguousDNA())
287
MN975262.1
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA())
29891
MN938389.1
Seq('AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCT...TTT', IUPACAmbiguousDNA())
107
MN908947.3
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA())
29903
MN970003.1
Seq('TAAACACCTCATACCACTTATGTACAAAGGACTTCCTTGGAATGTAGTGCGTAT...TTG', IUPACAmbiguousDNA())
290
MN970004.1
Seq('TAAACACCTCATACCACTTATGTACAAAGGACTTCCTTGGAATGTAGTGCGTAT...TTG', IUPACAmbiguousDNA())
290
MN938390.1
Seq('AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCT...TTT', IUPACAmbiguousDNA())
107
MN938384.1
Seq('CAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTG...GAC', IUPACAmbiguousDNA())
29838
MN938385.1
Seq('TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCC...CTT', IUPACAmbiguousDNA())
287
MN938387.1
Seq('AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCT...TTT', IUPACAmbiguousDNA())
107
MN938386.1
Seq('TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCC...CTT', IUPACAmbiguousDNA())
287

At this point I think I have a good structure to start working with the data in these files. Using what we learned in Week 1, next week I want to try to find the ori of the virus. Also, I want to try and create a phylogenetic tree like on nextstrain.org/ncov/.