learning to parse a fasta file with python -

- July 15, 2015

this question has answer here:

parsing fasta file using generator ( python ) 4 answers

i learning python , want parse fasta file without using biopython. txt file looks like:

>22567 cgtgtccaggtctatctcggaaatttgccgtcgttgcattactgtccagctccatgccca acatttggcatcggagaatgactccgcgtgataaagtcagaataggcattgagactcagg gtggtacctatta >34454 aaaactgtgcagccggtaacaggccgcgatgctgtactatatgtgtttggtacatatccg attcaggtatgtcagggagccagcaccggaggatccagaagtaagtcgggttgactactc ctagcctcgtttcaccatccgccggataactctcccttccatcatcaactcctccctttc gtgtccaatggggcggcgtgtctaagcactgccatatagctaccgaaaggcggcgacccc tcgga

i parse save headers of each sequence, >22567 , >34454 headers list (this working). , after each header read following sequence sequences list.

the output, like:

headers =  ['>22567','>34454'] sequences = ['cgtgtccaggtctatctcggaaatt...', aaaactttgtgaaaa....']

the problem have when try read sequences part, can't figure out how concatenate each line 1 sequence string before appending list. instead have each line appending sequence list.

the code have far is:

#!/usr/bin/python   import re   dna = [] sequences = []   def read_fasta(filename):     global seq, header, dna, sequences   #open file       open(filename) file:             seq = ''                 #forloop through lines         line in file:              header = re.search(r'^>\w+', line)             #if line contains header '>' append dna list              if header:                 line = line.rstrip("\n")                 dna.append(line)                         # in else statement have problems,             #else:                  #the proceeding lines before next '>' sequence each header,                 #concatenate these lines 1 string , append sequences list              else:                                seq = line.replace('\n', '')                   sequences.append(seq)        filename = 'gc.txt'  read_fasta(filename)

note: had solution on 1 of projects, directly pasted here. solution not mine , belongs poster here. please upvote his/her answer. @donkeykong finding original post

use list accumulate lines until reach new id. join lines , store them id in dictionary. following function takes open file , yields each pair of (id, sequence).

def read_fasta(fp):         name, seq = none, []         line in fp:             line = line.rstrip()             if line.startswith(">"):                 if name: yield (name, ''.join(seq))                 name, seq = line, []             else:                 seq.append(line)         if name: yield (name, ''.join(seq))  open('ex.fasta') fp:     name, seq in read_fasta(fp):         print(name, seq)

output:

('>22567', 'cgtgtccaggtctatctcggaaatttgccgtcgttgcattactgtccagctccatgcccaacatttggcatcggagaatgactccgcgtgataaagtcagaataggcattgagactcagggtggtacctatta') ('>34454', 'aaaactgtgcagccggtaacaggccgcgatgctgtactatatgtgtttggtacatatccgattcaggtatgtcagggagccagcaccggaggatccagaagtaagtcgggttgactactcctagcctcgtttcaccatccgccggataactctcccttccatcatcaactcctccctttcgtgtccaatggggcggcgtgtctaagcactgccatatagctaccgaaaggcggcgacccctcgga')

this answer on so. i'll try , find , give original poster credit.

Search This Blog

Shefl

learning to parse a fasta file with python -

Comments

Post a Comment

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - Gamma correction doesn't look properly corrected, is this linear? -