[Biojava-l] Issue with SimpleNCBITaxon class

Peter biopython at maubp.freeserve.co.uk
Thu Apr 15 17:54:56 UTC 2010


Hi,

I've CC'd this to the BioSQL mailing list for cross project
discussion.

On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland  wrote:
> Thanks Deepak.
>
> I've had a look at the code and I believe its due to the
> different ways in which BioJava and BioPerl load the
> taxon table.
>
> BioJava sets the ncbi_taxon_id and parent_taxon_id
> columns based on the values from the NCBI taxonomy
> file. The taxon_id column in BioJava is a meaningless
> auto-generated value that is never used.
>
> BioPerl however is generating taxon_id values and
> linking them by setting parent_taxon_id to the
> generated value. The parent value from the NCBI
> taxonomy file is therefore replaced with the BioPerl
> generated parent ID, meaning that instead of linking
> from parent_taxon_id to ncbi_taxon_id as per BioJava,
> the link is to taxon_id instead. (I'm basing this
> comment on looking at load_ncbi_taxonomy.pl from
> the BioSQL archives.)

Note that old versions of load_ncbi_taxonomy.pl
(which is part of BioSQL, not part of BioPerl) would
set taxon_id equal to ncbi_taxon_id, see:
http://bugzilla.open-bio.org/show_bug.cgi?id=2470

This may help explain the confusion.

> I believe if you load the taxonomy table using BioJava,
> you should see BioJava giving correct behaviour.
> Likewise if you load it using BioPerl, BioPerl will
> behave correctly. But if you load with one then query
> with the other, you'll get incorrect results.
>
> This sounds like a case for discussion on both lists -
> a matter of standardisation between the two projects.
> Not quickly/easily solvable for now.

Its not just two projects (BioPerl & BioJava) (grin).
Its at least five projects (BioSQL itself plus BioRuby
and Biopython).

I'm not sure about BioRuby's implementation, but
currently I think BioJava is the odd one out - BioPerl,
Biopython, and the BioSQL's load_ncbi_taxonomy.pl
all make entries in parent_taxon_id reference the
automatically generated taxon_id (please correct
me if I am wrong).

My personal view is that bioperl-db is the reference
implementation and should be followed in the event
of any ambiguity within BioSQL. In this particular
case, there is actually a BioSQL script to check
against too (load_ncbi_taxonomy.pl).

Hopefully Hilmar can give us an official verdict...

Peter



More information about the Biojava-l mailing list