Genome replication is one of the most important tasks carried out in the cell, Replication begins in a genomic region called the replication origin (denoted oriC) and is performed by molecular copy machines called DNA polymerases
In the following problem, we assume that a genome has a single oriC and is represented as a DNA string , or a string of nucleotides from the four-letter alphabet { A , C , G , T }
Finding Origin of Replication Problem :
Input: A DNA string Genome.
Output: The location of oriC in Genome.
STOP and Think: Does this biological problem represent a clearly stated compu- tational problem?
import sys
filedata = open(sys.argv[1]).read().split()
def mostFreq(text,k):
#given a DNA string text and an integer k, find all most frequent k-mers in text
#generate list of all kmers in text
kmerList = []
for i in range(len(text)-k+1):
kmerList.append(text[i:i+k])
#get the kmer counts
kmerCounts = {}
for kmer in kmerList:
kmerCounts[kmer] = kmerCounts.get(kmer,0) + 1
#identify most frequent kmers
maxCount = max(kmerCounts.values())
mostFreqKmers = [kmer for kmer,val in kmerCounts.items() if val == maxCount];
return mostFreqKmers
text = filedata[0]
k = int(filedata[1])
mostFreqKmers = mostFreq(text,k)
#print output to new file and open
fnew = 'ANS_'+sys.argv[1]
fh = open(fnew,'w')
fh.write(' '.join(mostFreqKmers))
fh.close()
import webbrowser
webbrowser.open(fnew)
import sys
DNAseq = ''.join(open(sys.argv[1]).read().split())
def reverseComplement(sequence):
#given a DNA string, find the reverse complement
#DNA complement dict
complements = {'A':'t','C':'g','G':'c','T':'a'}
#reverse the sequence for the output and then replace nuc's with their complements
revCompSeq = sequence[::-1]
for nuc,comp in complements.items():
revCompSeq = revCompSeq.replace(nuc,comp)
return revCompSeq.upper()
revCompSeq = reverseComplement(DNAseq)
#print output to new file and open
fnew = 'ANS_'+sys.argv[1]
fh = open(fnew,'w')
fh.write(revCompSeq)
fh.close()
import webbrowser
webbrowser.open(fnew)