[Biojava-dev] [Biojava-l] Issue with SimpleNCBITaxon class
Richard Holland
holland at eaglegenomics.com
Mon Apr 12 07:07:55 UTC 2010
Incidentally, BioJava's approach matches the description in the BioSQL docs at:
http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME
(first example SQL statement - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join)
The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this description.
cheers,
Richard
On 12 Apr 2010, at 07:57, Richard Holland wrote:
> Thanks Deepak.
>
> I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table.
>
> BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used.
>
> BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.)
>
> I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results.
>
> This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now.
>
> cheers,
> Richard
>
> On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:
>
>> I am using same table with biojava and bioperl taxon program and the output I get is below:
>>
>> Biojava:
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is
>> Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii.
>>
>> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things)
>>
>> Bioperl:
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is
>> Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus.
>>
>> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things)
>>
>> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id.
>>
>> Taxon and Taxon_name Table content which is being relevant in discussion:
>>
>> taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class
>> 2901 3609 276240 genus Rhamnus scientific name
>> 3610 4403 3609 species Platanus occidentalis scientific name
>> 29052 48579 4403 species Suillus placidus scientific name
>> 114412 143975 48579 species Diadasia australis scientific name
>> 143976 176516 143975 species Arnicastrum guerrerense scientific name
>> 30680 50447 176516 family Labiduridae scientific name
>> 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name
>> 9394 11632 17394 family Retroviridae scientific name
>> 277861 327045 9394 subfamily Orthoretrovirinae scientific name
>> 122448 153057 277861 genus Alpharetrovirus scientific name
>> 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name
>> 9584
>> 11876
>> 301952
>> species
>> Avian sarcoma virus
>> scientifice name
>>
>> Thanks
>> Deepak
>>
>> On 4/11/2010 2:53 PM, Richard Holland wrote:
>>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>>>
>>> thanks,
>>> Richard
>>>
>>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>>>
>>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>>
>>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>>> <property name="nodeRank" column="node_rank"/>
>>>> <property name="geneticCode" column="genetic_code"/>
>>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>>> <property name="leftValue" column="left_value"/>
>>>> <property name="rightValue" column="right_value"/>
>>>> <property name="parentNCBITaxID" column="parent_taxon_id"/> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>>>
>>>> Thanks
>>>> Deepak Sheoran
>>>>
>>>>
>>>>
>>>>
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E:
>>> holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>>
>>>
>>>
>>>
>>
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/
More information about the biojava-dev
mailing list