[BioSQL-l] alternative taxonomic hierarchies in BioSQL?

Hilmar Lapp hlapp at gmx.net
Thu Dec 4 22:07:39 UTC 2008


On Dec 4, 2008, at 12:06 PM, Peter wrote:

> On Thu, Dec 4, 2008 at 4:28 PM, Bánk Beszteri <Bank.Beszteri at awi.de>  
> wrote:
>>
>> Dear BioSQLers,
>>
>> do I understand right that the current BioSQL schema allows for a  
>> single
>> taxonomy per database only?
>
> Not quite.  If you ignore that fact that the taxon table's external
> taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could
> store any taxonomy in the taxon and taxon_name tables.  You could even
> have multiple independent taxonomies in these tables.

Right. Though it's certainly ugly to call something a ncbi_taxon_id  
when really it is a ITIS ID, for example.

Aside from that, the load_ncbi_taxonomy.pl script that comes with  
BioSQL can't really deal with other taxonomies being stored in the  
taxon tables, too. First, it will consider all nodes that it can't  
find in NCBI (by ID) as having been obsoleted and will delete them,  
and even if it somehow failed to do that, it would fail to compute the  
nested set enumeration for all other taxonomies.

Changing that would basically require namespacing taxon nodes. Though  
it's an option, it has increasingly struck me as a duplication of what  
the PhyloDB module provides already (see other comments below), so I  
am actually less and less in favor of it.

I think the appropriate way to look at the taxon tables is as the  
reference taxonomy for bioentries (and so calling the identifier  
ncbi_taxon_id is still bad as it prescribes the NCBI taxonomy as the  
reference). In this context:

> However, each bioentry can only point to one taxon entry (and thus
> belongs to only one taxonomy), which is a big limitation.

This is well motivated in biological applications and current object  
models. I'm not sure about the other Bio* toolkits, but BioPerl for  
example doesn't support multiple species objects for a sequence.

> It would be useful to have a bioentry point to multiple taxon entries
> (and thus multiple taxonomies, e.g. NCBI and ITIS), which might
> require some sort of link table between the taxon and bioentry tables.

Note that the PhyloDB module supports this. Nodes in a tree (or  
taxonomy) can be associated with one or more bioentries (and, in fact,  
reference taxon nodes).

> [...]
>> When looking into the tables taxon and taxon_name, it looks like  
>> neither
>> taxa nor their neighborhood relationships can belong to different  
>> taxonomies.
>> Is this correct, or am I missing something?
>
> True - but why would you want to interlink taxon entries like that?

There may be use-cases for this. For example, to relate taxa named  
differently between two taxonomies but that really are synonymous. Or  
one taxonomy containing a synonym that the other doesn't.

Not your molecular sequence database/analysis type of thing, sure. But  
still legitimate.

>
>
>> If this is so: are there any plans to add such a feature in the  
>> future? I
>> think (besides that I could use it) it could probably be useful for  
>> others
>> as well (to have the possibility to e.g. have an ITIS taxonomy

Note that the svn / main trunk version of BioSQL has a script  
load_itis_taxonomy.pl. It loads it into the PhyloDB module, though.  
ITIS isn't a single tree but actually 5; there is no common root. So  
it ends up as 5 trees in the PhyloDB tables.

>> or just a user?s own private taxonomy parallel to NCBI taxonomy in  
>> a single BioSQL
>> DB).

Yeah; I've been wanting to write a general taxonomy loader, or more  
precisely a loader that utilizes Bio::TreeIO for reading. Just haven't  
had time around to do that. (Need another hackathon :-)

> [...] I think the issue has been raised before on the mailing list,  
> and IIRC
> it was agreed that there was room for improvement.  Maybe this is
> something for BioSQL v1.1.x?

Fixing the ncbi_taxon_id column name definitely. As for letting the  
taxon tables duplicate the same capabilities as the PhyloDB tables,  
I'm not sure that that's the best route to go.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================







More information about the BioSQL-l mailing list