grep - Retrieve lines of a text that contain a certain string only in the end and not somewhere in between -
i have text file of taxonomic assignation of bacterias looks ( numbers indicate different bacterias):
1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 555445 k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 325910 k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__ruminococcaceae; g__; s__ 744205 k__bacteria; p__proteobacteria; c__deltaproteobacteria; o__; f__; g__; s__
many of bacterias don´t have classification down specie level, lack information : "s__". see bacterias have information (as in 2 bacterias above, 1 being "s__modestum" , other "s__fimetarium"). using mac terminal (mac os x 10.9.5) , tried,
grep -v "s__" file
but since assignation contain s__
noting (it excludes them guess..).
i have tried using *
@ end in s__*
doesn't work either.
what apply command , line , count of bacterias species assignation.
1 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 1 555445 k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium
just ask grep
match character after s__
not being end of line (expressed $
):
$ grep 's__[^$]' file 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 555445 k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium
to count of lines matching condition, need use awk
store counter values array:
$ awk '/s__[^$]/ {a[$0]++} end {for (i in a) print a[i], i}' file 1 555445 k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 1 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum
to make check appearing @ end of file, need checks:
grep -e 's__[^ $]+$' file awk '/s__[^ $]+$/ {a[$0]++} end {for (i in a) print a[i], i}' file
they check after s__
there set of @ least 1 character not being space or end of line. , then, end of line.
update
thank worked great! there way can sum lines know how many counts have in total of non "s__"? – isa
sure, add print length(a)
see how many elements array has:
$ awk '/s__[^ $]+$/ {a[$0]++} end {for (i in a) print a[i], i; print length(a)}' 1 555445 k__bacteria; p__firmicutes; c__clostridia; o__clostridiales; f__lachnospiraceae; g__clostridium; s__fimetarium 1 1130952 k__bacteria; p__acidobacteria; c__acidobacteriia; o__acidobacteriales; f__acidobacteriaceae; g__edaphobacter; s__modestum 2
Comments
Post a Comment