Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions.

1 The shortest and the longest sequence

How many records are in the file? The header line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is an optional description of the entry. There should be no space between the “>” and the first letter of the identifier.

What are the lengths of the sequences in the file? What is the longest sequence and what is the shortest sequence? Is there more than one longest or shortest sequence? What are their identifiers?

def fasta_summary(data, frame) :
    # Load in the file:
    try:
        f = open("/home/user/Desktop/dna.example.fasta")
    except IOError:
        print("File myfile.fa does not exist!!")

    seqs={}
    for line in f:
        # let's discard the newline at the end (if any)
        line = line.rstrip()
        # distinguish header from sequence
        if line.startswith('>'):
            words = line.split()
            name = words[0][1:]
            seqs[name] = ""
        else : # sequence, not header
            seqs[name] = seqs[name] + line
    
    close(f)

    # How many records are in the file? 
    print ("This file contains %1d records." % len(seqs))

    # What are the lengths of the sequences in the file?
    length = []
    for i in range(len(seqs.values())):
        length.append(len(seqs.values()[i]))
    length = sorted(length)
    print("The following are lengths of these records:")
    print(length)

    # What is the longest sequence and what is the shortest sequence? 
    m = length[0]
    for val in seqs.values():
        if len(val) == m:
            name_long = key
    print ("The longest sequence is %s and it contains %1d bases." % (name_long, length[0]))
    
    m = length[-1]
    for val in seqs.values():
        if len(val) == m:
            name_short = key
    print ("The shortest sequence is %s contains %1d bases." % (name_short, length[-1]))
    
    # Is there more than one longest or shortest sequence?
    print("The following are lengths of five longest and five shortest sequences.")
    print(length[0:5])
    print(length[-5:-1])

2 The longest ORF

Given an input reading frame on the forward strand (1, 2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: What is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF?

def forward_orfs_dictionary(data, frame) :
    """This function outputs the length of the longest ORF in a FASTA file,
    as well as name of the sequence that contains it."""
    
    #First open the file and load in the sequences as a dictionary"
    try:
        f = open(data)
    except IOError:
        print("File myfile.fa does not exist!!")

    seqs={}
    for line in f:
    # let's discard the newline at the end (if any)
        line = line.rstrip()
    # distinguish header from sequence
        if line.startswith('>'):
            words = line.split()
            name = words[0][1:]
            seqs[name] = ""
        else : # sequence, not header
            seqs[name] = seqs[name] + line

    f.close()
    
    # Now search for ORFs"
    longest = 0
    longest_name = 0
    for n in range(len(seqs)) :
        dna = seqs.values()[n]
        dna_name = seqs.keys()[n]
        stop_codons = ["tga", "tag", "taa"]
        start = 0
        for i in range(frame-1, len(dna), 3) :
            codon = dna[i:i+3].lower()
            if codon == "atg" :
                start = i+1
            if codon in stop_codons :
                if start != 0 :
                    stop = i+4
                    aa = stop - start
                    if aa > longest :
                        longest = aa
                        longest_name = dna_name
                    start = 0
    return (longest, longest_name)

For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence.

def forward_orfs(dna, frame) :
    """This function outputs the length and start position of the longest ORF 
    in a selected forward reading frame of a single DNA sequence."""
    stop_codons = ["tga", "tag", "taa"]
    start = 0
    longest = 0
    longest_start = 0
    for i in range(frame-1, len(dna), 3) :
        codon = dna[i:i+3].lower()
        if codon == "atg" :
            start = i+1
        if codon in stop_codons :
            if start != 0 :
                stop = i+4
                aa = stop - start
                if aa > longest :
                    longest = aa
                    longest_start = start
                start = 0
    return (longest, longest_start)

3 Repeats

A repeat is a substring of a DNA sequence that occurs in multiple copies (more than one) somewhere in the sequence. Although repeats can occur on both the forward and reverse strands of the DNA sequence, we will only consider repeats on the forward strand here. Also we will allow repeats to overlap themselves. For example, the sequence ACACA contains two copies of the sequence ACA - once at position 1 (index 0 in Python), and once at position 3. Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.

def count_repeats(data, replength) :
    "This function counts DNA repeats of a length n in a FASTA file."

    #First open the file and load in the sequences as a dictionary"
    try:
        f = open(data)
    except IOError:
        print("File myfile.fa does not exist!!")

    seqs={}
    for line in f:
    # let's discard the newline at the end (if any)
        line = line.rstrip()
    # distinguish header from sequence
        if line.startswith('>'):
            words = line.split()
            name = words[0][1:]
            seqs[name] = ""
        else : # sequence, not header
            seqs[name] = seqs[name] + line

    f.close()
    
    from collections import Counter
    
    # put all substrings from a file in one list
    all_list = []
    for n in range(len(seqs)) :
        dna = seqs.values()[n]
        for i in range(len(dna)) :
            string = dna[i:i+replength].lower()
            if len(string) == replength :
                all_list.append(string)
    
    # count substrings  
    repeats = collections.Counter(all_list)
    # remove those substrings which are present only once
    for i in repeats :
        if repeats[i] == 1 :
            repeats[i] = 0
    repeats += Counter()
    repeats.most_common(1)
    print("The most frequent repeat of specified length is '%s' and is repeated %3d times." % repeats.most_common(1)[0])
    print("This is the list of all repeats:")
    return repeats.most_common()

Python for Genomic Data Science

Bojan Zunar

11 August 2018

1 The shortest and the longest sequence

2 The longest ORF

3 Repeats