[BioSQL-l] Concerns the update of BioSQL.taxon table

Wed Mar 26 12:00:03 UTC 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Purely from a database perspective, the index is correct. There should
be no need to have a duplicate entry in ncbi_taxon_id. The implication
is that taxon_id is a 1:1 mapping to ncbi_taxon_id. There should be no
need to have two separate local taxon_id values referring to one NCBI taxon.

Ideally, when you run your update script, for each taxon_id record it
processes it should be checking for an existing entry with the same
ncbi_taxon_id, getting the taxon_id for that existing entry, then
removing the duplicate entry and updating the relevant parent_taxon_id
values in other records to refer to the existing taxon_id instead.

BioPython would need to be making similar checks when it inserts new
entries. If it isn't, then it needs to be fixed.

cheers,
Richard

Eric Gibert wrote:
> Thank you Peter for the correct email of the BioSQL list.
> 
> No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before.
> 
> I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database.
> 
> Example:
> I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
> +----------+---------------+-----------------+--------------+
> | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
> +----------+---------------+-----------------+--------------+
> |       13 |          2759 |            NULL | superkingdom |
> |       14 |         33208 |              13 | kingdom      |
> |       15 |          6656 |              14 | phylum       |
> |       16 |          6960 |              15 | superclass   |
> |       17 |         50557 |              16 | class        |
> |       18 |          7496 |              17 | no rank      |
> |       19 |         33339 |              18 | subclass     |
> |       20 |          6961 |              19 | order        |
> |       21 |          6962 |              20 | suborder     |
> |       22 |          6964 |              21 | family       |
> |       23 |        229390 |              22 | genus        |
> |       24 |        229391 |              23 | species      |
> 
> No problem.
> 
> Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function:
> |       25 |          NULL |            NULL | NULL         |
> |       26 |          NULL |              25 | NULL         |
> |       27 |          NULL |              26 | NULL         |
> |       28 |          NULL |              27 | NULL         |
> |       29 |          NULL |              28 | NULL         |
> |       30 |          NULL |              29 | NULL         |
> |       31 |          NULL |              30 | NULL         |
> |       32 |          NULL |              31 | NULL         |
> |       33 |          NULL |              32 | NULL         |
> |       34 |          NULL |              33 | NULL         |
> |       35 |          NULL |              34 | genus        |
> |       36 |        320892 |              35 | species      |
> 
> then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'.
> 
> Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father.
> 
> Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table.
> 
> 
> Best regards,
> 
> Eric
> 
> 
> 
> 
> 
> 
>       _____________________________________________________________________________ 
> Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. http://mail.yahoo.fr
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH6jrD4C5LeMEKA/QRAu7rAJ9TBYt0CeTTrPi0QN7Vm/UwiBANQwCfeoqz
0uTvcXXteholK+4xxuxjCXw=
=qhOf
-----END PGP SIGNATURE-----