[Bioperl-l] How to remove redundancy ?

Jason Stajich jason@cgt.mc.duke.edu
Fri, 15 Nov 2002 13:22:07 -0500 (EST)


If indeed you want a nonredundant sequence db just use nrdb:
http://blast.wustl.edu/pub/nrdb/

This won't require you to rely on the sequences being in genbank.

-jason
On Fri, 15 Nov 2002, Marc Logghe wrote:

> I think he means a non-redundant data set on the sequence level, like nr and
> nt of genbank.
> Actually I have been looking for that a few days ago. In my search I bumped
> into 'Cleanup' but that appears only to work on nucleotide sequences; plus
> it is not available anymore for download. You can only make use of the web
> interface http://bighost.area.ba.cnr.it/BIG/CleanUP/. Too bad.
> What I did was obtain all gi numbers from ncbi of the subset I needed and
> fed it to fastacmd like this:
> fastacmd -d nr -i gi_list | infoseq -filter -only -name | ./make_unique
> make_unique is a little perl script generating a unique identifier set
> #!/usr/bin/perl -w
> my %seen;
> while (<>)
> {
>   chomp;
>   $seen{$_}++;
>
> }
> print join "\n", keys %seen;
>
> In that way you can get a non-redundant sequence dataset by feeding the
> non-redundant identifier list to fastacmd (to get the sequences themselves)
> or to ncbi blast directly (-l subset option)
> Hope the 'off-topic'-level is not too high with this answer ;-)
> Regards,
> Marc
>
>
> > -----Original Message-----
> > From: nkuipers [mailto:nkuipers@uvic.ca]
> > Sent: Friday, November 15, 2002 5:12 PM
> > To: Giuseppe Torelli
> > Cc: bioperl-l@bioperl.org
> > Subject: RE: [Bioperl-l] How to remove redundancy ?
> >
> >
> > Perhaps you could be more specific by what you mean by
> > "redundancy"?  And what
> > format your data set is in?  For example, assuming fasta
> > format and redundancy
> > meaning duplications in the data set, are you referring to
> > primary IDs,
> > accession numbers, descriptions, or the sequences themselves?
> >  If this was the
> > case you could roll a solution with BioSeqIO.  Read in the
> > file, pull out the
> > information of interest (what you are defining as redundant)
> > with one of the
> > "get property" sorts of methods (like $obj->desc) and test
> > that information
> > against a hash populated as you go.  If it already exists,
> > move to the next
> > one, otherwise write it out to a new file.
> >
> > Regards,
> >
> > Nathanael Kuipers
> > ---
> > Center for Biomedical Research,
> > Dept. of Biology,
> > University of Victoria
> >
> >
> > >===== Original Message From Giuseppe Torelli
> > <torelli@alpha.szn.it> =====
> > >Hi,
> > >
> > >which software do you use to remove redundancy
> > >from a gene dataset ?
> > >
> > >Thank you,
> > >--
> > >Giuseppe Torelli
> > >
> > >Bioinformatic Programmer
> > >Laboratory of Molecular Evolution
> > >Stazione Zoologica A. Dohrn
> > >Villa Comunale
> > >80121 Naples - Italy
> > >Tel.  0039 81 5833311
> > >Fax: 0039 81 7641355
> > >_______________________________________________
> > >Bioperl-l mailing list
> > >Bioperl-l@bioperl.org
> > >http://bioperl.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu