[Bioperl-l] How to remove redundancy ?
Marc Logghe
Marc.Logghe@devgen.com
Fri, 15 Nov 2002 17:34:03 +0100
I think he means a non-redundant data set on the sequence level, like nr and
nt of genbank.
Actually I have been looking for that a few days ago. In my search I bumped
into 'Cleanup' but that appears only to work on nucleotide sequences; plus
it is not available anymore for download. You can only make use of the web
interface http://bighost.area.ba.cnr.it/BIG/CleanUP/. Too bad.
What I did was obtain all gi numbers from ncbi of the subset I needed and
fed it to fastacmd like this:
fastacmd -d nr -i gi_list | infoseq -filter -only -name | ./make_unique
make_unique is a little perl script generating a unique identifier set
#!/usr/bin/perl -w
my %seen;
while (<>)
{
chomp;
$seen{$_}++;
}
print join "\n", keys %seen;
In that way you can get a non-redundant sequence dataset by feeding the
non-redundant identifier list to fastacmd (to get the sequences themselves)
or to ncbi blast directly (-l subset option)
Hope the 'off-topic'-level is not too high with this answer ;-)
Regards,
Marc
> -----Original Message-----
> From: nkuipers [mailto:nkuipers@uvic.ca]
> Sent: Friday, November 15, 2002 5:12 PM
> To: Giuseppe Torelli
> Cc: bioperl-l@bioperl.org
> Subject: RE: [Bioperl-l] How to remove redundancy ?
>
>
> Perhaps you could be more specific by what you mean by
> "redundancy"? And what
> format your data set is in? For example, assuming fasta
> format and redundancy
> meaning duplications in the data set, are you referring to
> primary IDs,
> accession numbers, descriptions, or the sequences themselves?
> If this was the
> case you could roll a solution with BioSeqIO. Read in the
> file, pull out the
> information of interest (what you are defining as redundant)
> with one of the
> "get property" sorts of methods (like $obj->desc) and test
> that information
> against a hash populated as you go. If it already exists,
> move to the next
> one, otherwise write it out to a new file.
>
> Regards,
>
> Nathanael Kuipers
> ---
> Center for Biomedical Research,
> Dept. of Biology,
> University of Victoria
>
>
> >===== Original Message From Giuseppe Torelli
> <torelli@alpha.szn.it> =====
> >Hi,
> >
> >which software do you use to remove redundancy
> >from a gene dataset ?
> >
> >Thank you,
> >--
> >Giuseppe Torelli
> >
> >Bioinformatic Programmer
> >Laboratory of Molecular Evolution
> >Stazione Zoologica A. Dohrn
> >Villa Comunale
> >80121 Naples - Italy
> >Tel. 0039 81 5833311
> >Fax: 0039 81 7641355
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l@bioperl.org
> >http://bioperl.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>