[Bioperl-l] Bio::Taxonomy changes

Tue Jul 25 14:58:29 UTC 2006

Agreed.  I fully support the addition of an optional lookup; it gives much
more flexibility SeqIO re: your previous examples of screening sequence
streams for sequences that are primate, mitochondrial, etc.  The key word I
want to emphasize is 'optional', not 'enforced'.  

I appreciate what Sendu is trying to do; I really do.  I think carrying over
an object named 'Bio::Species' into Taxonomy is too confusing (your
'contagion' analogy, as it were).  The 'species' concept (biologically
speaking here, not talking about the Bioperl class) is a taxonomic rank
(i.e. part of a taxonomy).  I'm trying to take a biologist's point of view
here.  What is a 'species'?  Or, if we were to stick strictly with using
NCBI definitions, what is a 'species'?

The NCBI definition of 'species' is simply a rank in a lineage, so it is (in
Bioperl terms) a Node.  If we were to follow that line of reasoning, why
also have a Species object represent a Taxonomy as well?  It's way too
confusing.

Sendu's repeatedly stating "a Species is a Taxonomy" makes some sense in a
BioPerl world only, as we're speaking about a class that has been around for
a long time, one that acted as a container of sorts for sequence data.  And
I understand what he intends to do.  

Conceptually speaking here, though, the way it is laid out, a Bio::Species
object can hold a Node that represents a 'species' rank, as well as a
'genus' Node, and a 'family' node, and on and on.  That's not a 'species',
that's a taxonomy.  So just call it a Taxonomy.

The object itself (Bio::Species) never truly represented a 'species' anyway,
biologically speaking, every time it held sequence data.  It could be a
subspecies, strain, plasmid, unknown, or an unclassified rank ('no rank') or
environmental sample. It really held a fancier representation of a node, as
based on the TaxID.  

My final point is, saying "a species is a taxonomy" to the rest of the
biological world doesn't make sense.  Maybe it makes sense to you and I and
Sendu, in our little Bioperl world.  But to the thousands of users out there
who don't completely grok the Bioperl class structure, it's just confusing.

If I were to get an object back that was labeled Bio::Species, as a
biologist I would expect it to be part of a taxonomy, not the actual
Taxonomy itself.  So, why not cut to the chase: if we are to fundamentally
change the concept of what Bio::Species is by making it a Taxonomy/TaxonomyI
or whatever, why not just use a Taxonomy object altogether and not bother
with Bio::Species at all?  Deprecate it.

BTW, I'll be in Connecticut for five days at UConn.  So I hope to escape the
heat for a bit.  Thanks for listening to my side of things.  

Chris

> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Tuesday, July 25, 2006 8:54 AM
> To: Chris Fields
> Cc: Sendu Bala; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
> 
> We intend on having everyone who wants correct taxonomy parsing
> results for the entire kingdom of life to define his/her
> authoritative taxonomy database, be it local or not, be it HTTP or
> SQL queried.
> 
> If you don't care about the correctness of the taxonomy parse, or if
> the taxonomy information in the flat file is trivially parseable
> because it conforms to standard binomial convention, then whatever is
> to be put in place needs to work fine regardless of whether a
> taxonomy database is defined or not.
> 
> 	-hilmar
> 
> On Jul 25, 2006, at 1:53 AM, Chris Fields wrote:
> 
> > So do we intend on having everyone who installs bioperl have a local
> > copy of the taxonomy dumpfile?  Or perform a remote lookup via
> > Entrez?  Seems a bit extreme.
> >
> > I would like the option of not having the lookup run; as I mentioned
> > to Sendu, one of the biggest complaints about bioperl is speed.
> > Additional lookups won't help on that end.
> >
> > Chris
> >
> > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote:
> >
> >>
> >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote:
> >>
> >>> [...]
> >>> We could go back and forth on what Jason really intended. [...] The
> >>> reality is he's not here and you're willing to do the job.
> >>
> >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing
> >> his original idea develop in a possibly different direction, provided
> >> it will all work nicely in the end. I'm willing to take the beating
> >> on me if that doesn't turn out to be true ...
> >>
> >>>
> >>> There is one thing I will make perfectly clear here: there should
> >>> never, ever be enforced lookups for SeqIO (even using caches),
> >>
> >> You certainly don't want taxonomy lookups during the parsing stage,
> >> and also not for the client requesting properties of the species that
> >> have been parsed with high confidence, i.e.,  genus and species for a
> >> straightforward binomial like 'Homo sapiens'.
> >>
> >> Writing sequences, IMHO, doesn't have to be as fast. It may be better
> >> to emit strict format a bit slower rather than sloppy format a bit
> >> faster.
> >>
> >> Upon parsing, one idea could be for the flat file parser to set a
> >> dirty bit in the parsed out species if the parsed text didn't follow
> >> strict binomial conventions, hence the parser may have made a mistake
> >> and if a client requests the information it is better to lookup the
> >> correct values from a taxonomy database. I.e., you could try with a
> >> strict regex first that would imply a high-confidence result. If that
> >> fails you don't give up but mark the result as untrustworthy.
> >>
> >>
> >>> [...]
> >>> This would have been MUCH easier if all three of us could have gone
> >>> to the local bar for a beer and discussed it. We should just take
> >>> the time out to videoconference next time.
> >>
> >> You're not honestly suggesting that a videoconference is better than
> >> having beer together?
> >>
> >> Enjoy your trip, and thanks for hanging in there in the discussion, I
> >> appreciate it.
> >>
> >> 	-hilmar
> >> --
> >> ===========================================================
> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
> 
> 
>