[Bioperl-l] Categorization of EST's by species/taxonomy/lineage

Mark Johnson mjohnson at watson.wustl.edu
Thu Apr 29 16:50:58 EDT 2004


     I've got a bunch of flat files containing EST sequences (GenBank
format) from the NCBI ftp site.  I'd like to sort through them,
categorize them, and build some blast databases.  It would be nice to
be able to sort them into a few different piles, such as vertebrate,
invertebrate, fungi, species1, species2, speciesN, etc.
     To this end, having the full 'lineage' available would be handy. 
However, EST records from the EST database only have the organism
(unlike, say, mRNA records from the nucleotide database, which tend
to have the full lineage (Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;
Hominidae; Homo).
     With mRNA records from the nucleotide database, this is an easy job,
just call $seq->species->classification(), and sort through the list.
 However, with these EST files from dbEST, that doesn't work, the
resulting list is empty.
     I initially had high hopes after discovering Bio::DB::Taxonomy, but
there are some bugs in the 1.4 version, and even upgrading to the
latest in CVS, I can't seem to find a way to get the full lineage:

#Bio::DB::Taxonomy (Well, really Bio::DB::Taxonomy::entrez)
my $db = new Bio::DB::Taxonomy(-source => 'entrez');
my $taxaid = $db->get_taxonid('Homo sapiens');

#Bio::Taxonomy::Node
my $taxobj = $db->get_Taxonomy_Node(-taxonid => $taxaid);

#@classificiation contains 'sapiens' and 'homo'.
my @classification = $taxobj->classification();

Looking at the code for the classification method, I came accross this
comment:  # okay this won't really work - need to do proper recursion

So...is there a way to get to where I want to be without hacking on the
module(s) in some terribly caveman like fashion?




More information about the Bioperl-l mailing list