[Bioperl-l] How to remove redundancy ?

nkuipers nkuipers@uvic.ca
Fri, 15 Nov 2002 08:12:23 -0800

Perhaps you could be more specific by what you mean by "redundancy"?  And what 
format your data set is in?  For example, assuming fasta format and redundancy 
meaning duplications in the data set, are you referring to primary IDs, 
accession numbers, descriptions, or the sequences themselves?  If this was the 
case you could roll a solution with BioSeqIO.  Read in the file, pull out the 
information of interest (what you are defining as redundant) with one of the 
"get property" sorts of methods (like $obj->desc) and test that information 
against a hash populated as you go.  If it already exists, move to the next 
one, otherwise write it out to a new file.


Nathanael Kuipers
Center for Biomedical Research,
Dept. of Biology,
University of Victoria

>===== Original Message From Giuseppe Torelli <torelli@alpha.szn.it> =====
>which software do you use to remove redundancy
>from a gene dataset ?
>Thank you,
>Giuseppe Torelli
>Bioinformatic Programmer
>Laboratory of Molecular Evolution
>Stazione Zoologica A. Dohrn
>Villa Comunale
>80121 Naples - Italy
>Tel.  0039 81 5833311
>Fax: 0039 81 7641355
>Bioperl-l mailing list