r - Select matrix first row based on 1st, 8th and 9th column value with awk or sed -


i have rows 1st, 8th , 9th columns same. total number of rows on 60k. want simplify keeping first rows 1st,8th , 9th column same.

input file:

chr exon_start  exon_end    cnv tumor_doc   control_doc rationormalized_after_smoothing cnv_start   cnv_end seg_mean chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502 chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502 chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407 chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407 chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407 chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407 chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175 chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175 chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175 chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175 chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208 chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208 chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208 chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038 chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038 chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038 

desired output:

chr exon_start  exon_end    cnv tumor_doc   control_doc rationormalized_after_smoothing cnv_start   cnv_end seg_mean chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502 chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407 chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175 chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208 chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038 

i keep 1 row each distinct categories have similar column1, column 8th , column 9, best keep first row whenever there change.

how can achieve in awk, sed or in r?

import data r (you specify file):

df <- read.table(text = "chr exon_start  exon_end    cnv tumor_doc   control_doc rationormalized_after_smoothing cnv_start   cnv_end seg_mean chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502 chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502 chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407 chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407 chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407 chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407 chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175 chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175 chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175 chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175 chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208 chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208 chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208 chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038 chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038 chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038", header = true) 

extract rows columns 1, 8, 9 not duplicates of earlier rows:

df[!duplicated(df[, c(1,8,9)]),] #     chr exon_start exon_end cnv tumor_doc control_doc rationormalized_after_smoothing cnv_start  cnv_end seg_mean #1   chr1     762097   762270   3       821         717                       1.4566102    762097  6706109 1.297329 #3   chr1    7868860  7869039   2        78         119                       1.1233852   7796356  8921423 1.088752 #7   chr1   21012415 21012609   3        89         135                       1.2304212  19536504 21054539 1.247494 #11 chr14   20920169 20920704   3       211         214                       1.2542613  20840851 20923828 1.288877 #14 chr14   20924141 20924329   2       244         344                       0.9022995  20924141 21465086 1.088234 

Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -