[Bioperl-l] removing redundant accession numbers

Thu Sep 7 22:41:57 UTC 2006

> acession_numbers.txt contain: (a '>'' followed by two lower case
> alphabets followed by ten digits).
> Some of the accession numbers may be repeated in the file, like for
> example >ci0100130090 is repeated 3 times, >ci0100130340 is repeated 3
> times etc; >ci0100130320 2 times etc;
> I would want the output file for a program telling me, that
>> ci0100130090 - 3 times
>> ci0100130320 - 2 times

If you are on a Unix system the easiest way is to use the power of 
piping between shell commands:

% cat acession_numbers.txt | sort | uniq -c

       3 >ci0100130090
       2 >ci0100130320
       2 >ci0100130340
       1 >ci0100130574
       2 >ci0100130804
       2 >ci0100130945
       1 >ci0100130986
       1 >ci0100131137
       1 >ci0100131140

If you want to strip the '>' symbol and put the count after the 
accession with the 'times', just add more parts to the pipe:

% cat acession_numbers.txt
| sed -e 's/^>//'
| sort
| uniq -c
| awk '{ print $2,"-",$1,"times" }'

ci0100130090 - 3 times
ci0100130320 - 2 times
ci0100130340 - 2 times
ci0100130574 - 1 times
ci0100130804 - 2 times
ci0100130945 - 2 times
ci0100130986 - 1 times
ci0100131137 - 1 times
ci0100131140 - 1 times

Hope that helps,

-- 
Dr Torsten Seemann               http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia