learning to parse a fasta file with python -


this question has answer here:

i learning python , want parse fasta file without using biopython. txt file looks like:

>22567 cgtgtccaggtctatctcggaaatttgccgtcgttgcattactgtccagctccatgccca acatttggcatcggagaatgactccgcgtgataaagtcagaataggcattgagactcagg gtggtacctatta >34454 aaaactgtgcagccggtaacaggccgcgatgctgtactatatgtgtttggtacatatccg attcaggtatgtcagggagccagcaccggaggatccagaagtaagtcgggttgactactc ctagcctcgtttcaccatccgccggataactctcccttccatcatcaactcctccctttc gtgtccaatggggcggcgtgtctaagcactgccatatagctaccgaaaggcggcgacccc tcgga 

i parse save headers of each sequence, >22567 , >34454 headers list (this working). , after each header read following sequence sequences list.

the output, like:

headers =  ['>22567','>34454'] sequences = ['cgtgtccaggtctatctcggaaatt...', aaaactttgtgaaaa....']   

the problem have when try read sequences part, can't figure out how concatenate each line 1 sequence string before appending list. instead have each line appending sequence list.

the code have far is:

#!/usr/bin/python   import re   dna = [] sequences = []   def read_fasta(filename):     global seq, header, dna, sequences   #open file       open(filename) file:             seq = ''                 #forloop through lines         line in file:              header = re.search(r'^>\w+', line)             #if line contains header '>' append dna list              if header:                 line = line.rstrip("\n")                 dna.append(line)                         # in else statement have problems,             #else:                  #the proceeding lines before next '>' sequence each header,                 #concatenate these lines 1 string , append sequences list              else:                                seq = line.replace('\n', '')                   sequences.append(seq)        filename = 'gc.txt'  read_fasta(filename) 

note: had solution on 1 of projects, directly pasted here. solution not mine , belongs poster here. please upvote his/her answer. @donkeykong finding original post

use list accumulate lines until reach new id. join lines , store them id in dictionary. following function takes open file , yields each pair of (id, sequence).

def read_fasta(fp):         name, seq = none, []         line in fp:             line = line.rstrip()             if line.startswith(">"):                 if name: yield (name, ''.join(seq))                 name, seq = line, []             else:                 seq.append(line)         if name: yield (name, ''.join(seq))  open('ex.fasta') fp:     name, seq in read_fasta(fp):         print(name, seq) 

output:

('>22567', 'cgtgtccaggtctatctcggaaatttgccgtcgttgcattactgtccagctccatgcccaacatttggcatcggagaatgactccgcgtgataaagtcagaataggcattgagactcagggtggtacctatta') ('>34454', 'aaaactgtgcagccggtaacaggccgcgatgctgtactatatgtgtttggtacatatccgattcaggtatgtcagggagccagcaccggaggatccagaagtaagtcgggttgactactcctagcctcgtttcaccatccgccggataactctcccttccatcatcaactcctccctttcgtgtccaatggggcggcgtgtctaagcactgccatatagctaccgaaaggcggcgacccctcgga') 

this answer on so. i'll try , find , give original poster credit.


Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -