[Bioperl-l] removing redundant accession numbers
Torsten Seemann
torsten.seemann at infotech.monash.edu.au
Thu Sep 7 22:41:57 UTC 2006
> acession_numbers.txt contain: (a '>'' followed by two lower case
> alphabets followed by ten digits).
> Some of the accession numbers may be repeated in the file, like for
> example >ci0100130090 is repeated 3 times, >ci0100130340 is repeated 3
> times etc; >ci0100130320 2 times etc;
> I would want the output file for a program telling me, that
>> ci0100130090 - 3 times
>> ci0100130320 - 2 times
If you are on a Unix system the easiest way is to use the power of
piping between shell commands:
% cat acession_numbers.txt | sort | uniq -c
3 >ci0100130090
2 >ci0100130320
2 >ci0100130340
1 >ci0100130574
2 >ci0100130804
2 >ci0100130945
1 >ci0100130986
1 >ci0100131137
1 >ci0100131140
If you want to strip the '>' symbol and put the count after the
accession with the 'times', just add more parts to the pipe:
% cat acession_numbers.txt
| sed -e 's/^>//'
| sort
| uniq -c
| awk '{ print $2,"-",$1,"times" }'
ci0100130090 - 3 times
ci0100130320 - 2 times
ci0100130340 - 2 times
ci0100130574 - 1 times
ci0100130804 - 2 times
ci0100130945 - 2 times
ci0100130986 - 1 times
ci0100131137 - 1 times
ci0100131140 - 1 times
Hope that helps,
--
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
More information about the Bioperl-l
mailing list