[Bioperl-l] removing redundant accession numbers

Fri Sep 8 04:04:24 UTC 2006

Thanks a lot.. that lil bit of commands did the job....

On 9/8/06, Torsten Seemann <torsten.seemann at infotech.monash.edu.au> wrote:
> > acession_numbers.txt contain: (a '>'' followed by two lower case
> > alphabets followed by ten digits).
> > Some of the accession numbers may be repeated in the file, like for
> > example >ci0100130090 is repeated 3 times, >ci0100130340 is repeated 3
> > times etc; >ci0100130320 2 times etc;
> > I would want the output file for a program telling me, that
> >> ci0100130090 - 3 times
> >> ci0100130320 - 2 times
>
> If you are on a Unix system the easiest way is to use the power of
> piping between shell commands:
>
> % cat acession_numbers.txt | sort | uniq -c
>
>        3 >ci0100130090
>        2 >ci0100130320
>        2 >ci0100130340
>        1 >ci0100130574
>        2 >ci0100130804
>        2 >ci0100130945
>        1 >ci0100130986
>        1 >ci0100131137
>        1 >ci0100131140
>
> If you want to strip the '>' symbol and put the count after the
> accession with the 'times', just add more parts to the pipe:
>
> % cat acession_numbers.txt
> | sed -e 's/^>//'
> | sort
> | uniq -c
> | awk '{ print $2,"-",$1,"times" }'
>
> ci0100130090 - 3 times
> ci0100130320 - 2 times
> ci0100130340 - 2 times
> ci0100130574 - 1 times
> ci0100130804 - 2 times
> ci0100130945 - 2 times
> ci0100130986 - 1 times
> ci0100131137 - 1 times
> ci0100131140 - 1 times
>
> Hope that helps,
>
> --
> Dr Torsten Seemann               http://www.vicbioinformatics.com
> Victorian Bioinformatics Consortium, Monash University, Australia
>
>