[Bioperl-l] Bio::*Taxonomy* changes
Chris Fields
cjfields at uiuc.edu
Fri Jul 21 04:51:30 UTC 2006
> I didn't actually mean a stored file (but that would be possible
> with a
> tied hash or something: DB_File, just like flatfile), but an in-memory
> one for use during the course of program execution. Stored file would
> probably be dangerous because you wouldn't know if the data has become
> stale or not - and checking to see if it wasn't would defeat the
> point.
Okay, that wouldn't be a problem. I currently use in-memory caches
to hold NCBI history information and ELink information for
EUtilities. It would just a matter of doing the same for
Bio::DB::Taxonomy.
...
> entrez already parses through LineageEx to build the classification
> array. flatfile walks up all the parents to do the same. Having the
> information isn't the issue. We have the information. The methods
> genus() and species() need to work with the genbank fileformat,
> that is
> the problem.
The original purpose for Bio::Species was a simple object to hold
taxonomic information. This object was then used in an attempt to
hold the basic organism information (scientific name, common name,
lineage information, etc) contained in a RichSeq file, like GenBank,
EMBL, SwissProt, etc. The problem: trying to determine which term
in the lineage corresponds to which rank and what part of the
organism's scientific name is the genus, the species, and so on based
solely on the data in the file, which comes down to a best-guess
scenario for many organisms. It does work, but not equally well for
all RichSeq files, not for every organism, and definitely not all the
time. So, yes, genus(), species(), binomial, and other methods are
present, but one must realize that parsing out the data into the
appropriate object data using the various get/sets, with the obvious
exceptions, is not the best way.
Unless... you incorporate information available only outside the
actual file itself (i.e. NCBI Taxonomy information). This is where
Bio::Taxonomy seems to come along, as it's not-species specific (it
can represent any rank) and is also DB-aware. Though Bio::Species
was originally going to delegate all its data to Bio::Taxonomy::Node,
I think the purpose was to eventually replace Bio::Species.
So, my question is, why not use a Bio::Taxonomy::Node-like class
initially to contain the appropriate data for a GenBank file (just
for read/write purposes)? This object, since it implements
Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a
database could also get/set the appropriate object data correctly
using the lineage data. So, for instance, if I called
$species = $seq->species();
and wanted the classification, scientific_name(), common_name, and
other information that is gleaned from the file, then there's no need
for a lookup. Once you cross into the bounds of:
print $species->species();
print $species->genus();
then there's trouble, since we're working straight from the file
(i.e. parsing is mainly correct, but still guesswork and sometimes
wrong). But what if you could do something like this:
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
# normally not needed as this is set by default internally, but as a
demo here...
$species->db_handle($db);
# reset the appropriate data (genus, species, etc) based on Entrez
tax data
$species->reset_data(); # this method, BTW, doesn't exist yet but
should be easy to implement
print $species->species();
my $parent = $species->get_Parent_Node;
my @child = $species->get_Children_Nodes;
...and so on
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list