grep - Retrieve lines of a text that contain a certain string only in the end and not somewhere in between -


i have text file of taxonomic assignation of bacterias looks ( numbers indicate different bacterias):

1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 555445  k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 325910  k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__ruminococcaceae; g__; s__ 744205  k__bacteria; p__proteobacteria; c__deltaproteobacteria; o__; f__; g__; s__ 

many of bacterias don´t have classification down specie level, lack information : "s__". see bacterias have information (as in 2 bacterias above, 1 being "s__modestum" , other "s__fimetarium"). using mac terminal (mac os x 10.9.5) , tried,

grep -v "s__" file 

but since assignation contain s__ noting (it excludes them guess..).

i have tried using * @ end in s__* doesn't work either.

what apply command , line , count of bacterias species assignation.

1 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 1 555445    k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 

just ask grep match character after s__ not being end of line (expressed $):

$ grep 's__[^$]' file 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 555445  k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 

to count of lines matching condition, need use awk store counter values array:

$ awk '/s__[^$]/ {a[$0]++} end {for (i in a) print a[i], i}' file 1 555445  k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 1 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 

to make check appearing @ end of file, need checks:

grep -e 's__[^ $]+$' file awk '/s__[^ $]+$/ {a[$0]++} end {for (i in a) print a[i], i}' file 

they check after s__ there set of @ least 1 character not being space or end of line. , then, end of line.


update

thank worked great! there way can sum lines know how many counts have in total of non "s__"? – isa

sure, add print length(a) see how many elements array has:

$ awk '/s__[^ $]+$/ {a[$0]++} end {for (i in a) print a[i], i; print length(a)}' 1 555445  k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 1 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 2 

Comments

Popular posts from this blog

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - No viable overloaded operator for references a map -