learning to parse a fasta file with python -
this question has answer here:
i learning python , want parse fasta file without using biopython. txt file looks like:
>22567 cgtgtccaggtctatctcggaaatttgccgtcgttgcattactgtccagctccatgccca acatttggcatcggagaatgactccgcgtgataaagtcagaataggcattgagactcagg gtggtacctatta >34454 aaaactgtgcagccggtaacaggccgcgatgctgtactatatgtgtttggtacatatccg attcaggtatgtcagggagccagcaccggaggatccagaagtaagtcgggttgactactc ctagcctcgtttcaccatccgccggataactctcccttccatcatcaactcctccctttc gtgtccaatggggcggcgtgtctaagcactgccatatagctaccgaaaggcggcgacccc tcgga
i parse save headers of each sequence, >22567 , >34454 headers list (this working). , after each header read following sequence sequences list.
the output, like:
headers = ['>22567','>34454'] sequences = ['cgtgtccaggtctatctcggaaatt...', aaaactttgtgaaaa....']
the problem have when try read sequences part, can't figure out how concatenate each line 1 sequence string before appending list. instead have each line appending sequence list.
the code have far is:
#!/usr/bin/python import re dna = [] sequences = [] def read_fasta(filename): global seq, header, dna, sequences #open file open(filename) file: seq = '' #forloop through lines line in file: header = re.search(r'^>\w+', line) #if line contains header '>' append dna list if header: line = line.rstrip("\n") dna.append(line) # in else statement have problems, #else: #the proceeding lines before next '>' sequence each header, #concatenate these lines 1 string , append sequences list else: seq = line.replace('\n', '') sequences.append(seq) filename = 'gc.txt' read_fasta(filename)
note: had solution on 1 of projects, directly pasted here. solution not mine , belongs poster here. please upvote his/her answer. @donkeykong finding original post
use list accumulate lines until reach new id. join lines , store them id in dictionary. following function takes open file , yields each pair of (id, sequence).
def read_fasta(fp): name, seq = none, [] line in fp: line = line.rstrip() if line.startswith(">"): if name: yield (name, ''.join(seq)) name, seq = line, [] else: seq.append(line) if name: yield (name, ''.join(seq)) open('ex.fasta') fp: name, seq in read_fasta(fp): print(name, seq)
output:
('>22567', 'cgtgtccaggtctatctcggaaatttgccgtcgttgcattactgtccagctccatgcccaacatttggcatcggagaatgactccgcgtgataaagtcagaataggcattgagactcagggtggtacctatta') ('>34454', 'aaaactgtgcagccggtaacaggccgcgatgctgtactatatgtgtttggtacatatccgattcaggtatgtcagggagccagcaccggaggatccagaagtaagtcgggttgactactcctagcctcgtttcaccatccgccggataactctcccttccatcatcaactcctccctttcgtgtccaatggggcggcgtgtctaagcactgccatatagctaccgaaaggcggcgacccctcgga')
this answer on so. i'll try , find , give original poster credit.
Comments
Post a Comment