[Biojava-l] Issue with SimpleNCBITaxon class

Richard Holland holland at eaglegenomics.com
Mon Apr 12 07:07:55 UTC 2010


Incidentally, BioJava's approach matches the description in the BioSQL docs at:

 http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME

(first example SQL statement - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join)

The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this description.

cheers,
Richard

On 12 Apr 2010, at 07:57, Richard Holland wrote:

> Thanks Deepak. 
> 
> I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. 
> 
> BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used.
> 
> BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.)
> 
> I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results.
> 
> This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now.
> 
> cheers,
> Richard
> 
> On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:
> 
>> I am using same table with biojava and bioperl taxon program and the output I get is below:
>> 
>> Biojava:
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>>            Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. 
>> 
>> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   (wrong way of doing things)
>> 
>> Bioperl:    
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>>          Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified  Alpharetrovirus.
>> 
>> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
>> 
>> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id.
>> 
>> Taxon and Taxon_name Table content which is being relevant  in discussion:
>> 
>> taxon_id	ncbi_taxon_id	parent_taxon_id	node_rank	name	name_class
>> 2901	3609	276240	genus	Rhamnus	scientific name
>> 3610	4403	3609	species	Platanus occidentalis	scientific name
>> 29052	48579	4403	species	Suillus placidus	scientific name
>> 114412	143975	48579	species	Diadasia australis	scientific name
>> 143976	176516	143975	species	Arnicastrum guerrerense	scientific name
>> 30680	50447	176516	family	Labiduridae	scientific name
>> 254757	301952	50447	varietas	Oreostemma alpigenum var. haydenii	scientific name
>> 9394	11632	17394	family	Retroviridae	scientific name
>> 277861	327045	9394	subfamily	Orthoretrovirinae	scientific name
>> 122448	153057	277861	genus	Alpharetrovirus	scientific name
>> 301952	353825	122448	no rank	unclassified Alpharetrovirus	scientific name
>> 9584
>> 11876
>> 301952
>> species
>> Avian sarcoma virus
>> scientifice name
>> 
>> Thanks
>> Deepak 
>> 
>> On 4/11/2010 2:53 PM, Richard Holland wrote:
>>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>>> 
>>> thanks,
>>> Richard
>>> 
>>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>> 
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>>> 
>>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>> 
>>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>>> <property name="nodeRank" column="node_rank"/>
>>>> <property name="geneticCode" column="genetic_code"/>
>>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>>> <property name="leftValue" column="left_value"/>
>>>> <property name="rightValue" column="right_value"/>
>>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>>> 
>>>> Thanks
>>>> Deepak Sheoran
>>>> 
>>>> 
>>>> 
>>>> 
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: 
>>> holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>> 
>>> 
>>> 
>>> 
>> 
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/





More information about the Biojava-l mailing list