[Bioperl-l] Bio::*Taxonomy* changes
Chris Fields
cjfields at uiuc.edu
Wed Jul 26 02:16:36 UTC 2006
One last thing before I shut off bioperl for a week and concentrate
on Connecticut;
On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> If I were to get an object back that was labeled Bio::Species, as a
>> biologist I would expect it to be part of a taxonomy, not the actual
>> Taxonomy itself.
>
> I think this is the most important sentence in the discussion. Ok, so
> it's clear to me that a better solution is needed than my
> Bio::Taxonomy-related proposal. Sorry for being so slow on the
> uptake. I
> also needed to start trying to code my Taxonomy proposal to see some
> issues with it.
... Again, thanks for noticing that.
> ---- summary:
>
> It seems like the main problem with Node right now is that it has
> classification() and things like genus(). I propose pure Node method
> solutions to answer the questions classification() and genus() were
> implemented to answer, but in a better, cruft-free way.
>
> Bio::DB::Taxonomy::genbank anyone?
Ach... You're compromising here; that's not like you. I think
you're making this too complicated by trying too many things at
once. Don't think sudden dramatic changes in the API. Sneak changes
in in a way that doesn't scare users away, then let them get used to
the new way of grabbing Tax data. Make your point that it's more
accurate to do it this way (you'll have defenders in Hilmar and I, BTW).
Do this (start with genbank.pm):
1) Switch out Bio::Species with Node or Taxonomy; relocate other
information temporarily (Bio::Species, get/sets in Seq object,
SimpleValue). Leave Bio::Species in for the time being, but don't
bother making any additional changes to it.
2) Make sure next_seq() and write_seq() work and pass tests. Add
additional tests for the Tax/Node object (you could even use the tax
dump data you recently added for more complicated tests).
3) Add in additional stuff bit by bit until it is where you would
like it.
4) Make sure parsing is kosher with the latest release notes.
Probably should make sure write_seq follows what the release note
state to some degree.
And, really, you won't break anything with genbank.pm organelle()
parsing. If you look at the module the organelle isn't even touched
in next_seq() or _read_GenBank_Species(), so it was broken to begin
with!
My proposal, though extreme, was to remove genus() etc (which you
wanted as well with Node). You could leave this cruft for the time
being in Bio::Species, which could still act as a sequence tax info
holder object. It just won't be the >default< Seq tax information
object, which would be Bio::Taxonomy or Node.
Hence Hilmar's suggestion to use a $seq->taxon() method to return a
Node/Taxonomy, and a $seq->species() would still return a
Bio::Species object. It's redundant, but only for the time being,
and the redundant information wouldn't have a major memory footprint
anyway (not like the feature table or the full sequence might). Any
information that isn't stored in whatever Tax object you use (i.e.
lineage or organelle) could be stored temporarily in another fashion,
such as a get/set in Seq or SimpleValue object, to make next_seq/
write_seq work (such as $seq->organelle() or $seq->classification(),
instead of $seq->species->organelle and so on).
Hilmar then suggests, around 1.6-ish release, note the changes made
to SeqIO towards Bio::Taxonomy-based objects, and indicate that
Bio::Species via species() and it's associated methods will be
deprecated around 1.7 (gives everybody notice on API issues). Then
add warnings to Bio::Species in 1.7 noting the deprecation, then
remove from core completely in 1.8 - 2.0.
One last thing, which is minor really: I remember seeing something
about having Nodes with 'no rank' ignored unless a flag is used.
That may be bad news for some organisms in sequence files where the
TaxID is for a 'no rank' rank, such as environmental samples. May
want to think about that here.
I'm hoping the releases will start popping out a bit more
periodically than they have been. There have been volunteers to
release periodic updates for bug fixes etc.
If I get a chance I'll try keeping up. Don't count on it though.
The conference is 7am-9pm most days, for five days straight!
Chris
>
> Then if you started with a Species/Node generated by a genbank parse,
> and wanted certain questions answered correctly, you only have to
> set a
> different db_handle(). The Node only stores the static and hopefully
> correct information about itself, whilst all other questions go via
> db_handle, so you can dynamically swap back and forth between
> databases
> depending on if you need speed or accuracy.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list