<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Do you know what the "SELECT taxon_id FROM taxon_name WHERE name LIKE '%fungi%’” returns? If you run<div class=""><br class=""></div><div class="">SELECT taxon_id, name FROM taxon_name WHERE name LIKE '%fungi%’</div><div class=""><br class=""></div><div class="">What is the result?</div><div class=""><br class=""></div><div class="">One note to be kept in mind is that there are nodes in the NCBI taxonomy that do not appear in the NCBI taxonomy browser. (They are actually marked with a flag in the download.) So just because on the NCBI taxonomy browser the lineage up from Gorilla gorilla gorilla does not show a taxon with “fungi” somewhere in the name doesn’t mean there can’t be one. Notably, Opisthokonta <i class="">are</i> in the lineage, and they do include Fungi. So it’s not entirely out of the question that there isn’t a hidden node there that has “fungi” in the name and is an ancestor to Gorilla gorilla gorilla.</div><div class=""><br class=""></div><div class="">Perhaps you have checked that, but this would be the Occam’s razor explanation, so it should be disproven first.</div><div class=""><br class=""></div><div class=""> -hilmar</div><div class=""><br class=""><div class=""><div><blockquote type="cite" class=""><div class="">On Jul 1, 2019, at 6:09 PM, Pedro Almeida <<a href="mailto:p.almeida.mc@gmail.com" class="">p.almeida.mc@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html charset=utf-8" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Hi Peter,</div><div class=""><br class=""></div><div class="">apologies for taking so long to reply but was caught in the middle of other projects.</div><div class=""><br class=""></div><div class="">Many thanks for your suggestion. So I went back to the very beginning and found what seems to be something odd (?).</div><div class=""><br class=""></div><div class="">I’m using SQLite version `3.28.0 2019-04-16 19:49:53 884b4b7e502b4e991677b53971277adfaf0a04a284f8e483e2553d0f83156b50`.</div><div class=""><br class=""></div><div class="">Starting from the instructions to set up an initial database with BioSQL, I did the following steps without loading any actual records besides the taxonomy data.</div><div class=""><br class=""></div><div class="">```</div><div class=""><div class="">$ sqlite3 test.db < biosqldb-sqlite.sql</div><div class="">$ perl load_ncbi_taxonomy.pl --dbname test.db --driver SQLite —download</div></div><div class=""><div class="">Downloading NCBI taxon database to taxdata</div><div class="">Loading NCBI taxon database in taxdata:</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... retrieving all taxon nodes in the database</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... reading in taxon nodes from nodes.dmp</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... insert / update / delete taxon nodes</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... updating new parent IDs</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... (committing nodes)</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... rebuilding nested set left/right values</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... reading in taxon names from names.dmp</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... deleting old taxon names</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... inserting new taxon names</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>... cleaning up</div><div class="">Done.</div></div><div class=""><div class="">$ echo $?</div><div class="">0</div></div><div class="">```</div><div class=""><br class=""></div><div class="">This finished in ~10 min without any erros.</div><div class=""><br class=""></div><div class="">I then used the exact command to "list all species(final nodes) of Fungi (so no all hierarchy of parent nodes but lists the downstream, child path instead):” fully copy/pasted as in <a href="https://biosql.org/wiki/Schema_Overview" class="">https://biosql.org/wiki/Schema_Overview</a>. I believe these are SQL commands and the syntax should be exactly the same in SQLite.</div><div class=""><br class=""></div><div class="">```</div><div class=""><div class="">SELECT DISTINCT include.ncbi_taxon_id FROM taxon</div><div class=""> INNER JOIN taxon AS include ON</div><div class=""> (include.left_value BETWEEN taxon.left_value</div><div class=""> AND taxon.right_value)</div><div class=""> WHERE taxon.taxon_id IN</div><div class=""> (SELECT taxon_id FROM taxon_name</div><div class=""> WHERE name LIKE '%fungi%')</div></div><div class="">```</div><div class=""><br class=""></div><div class="">I would expect these commands to return the `ncbi_taxon_id` which should correspond to the UID of the NCBI Taxonomy Browser; for fungi only because that is what we are searching here. If this is true then it seems that something is not working with either one of the three steps above. For example, we know that Gorilla gorilla gorilla is listed in the NCBI taxonomy with the UID 9595 (<a href="https://www.ncbi.nlm.nih.gov/taxonomy/?term=9595[uid]" class="">https://www.ncbi.nlm.nih.gov/taxonomy/?term=9595%5Buid%5D</a>) but running the above SQL commands from a file and grep'ing for 9595, gives a match which was not expected.</div><div class=""><br class=""></div><div class="">```</div><div class=""><div class="">$ cat cmds.sql</div><div class="">SELECT DISTINCT include.ncbi_taxon_id FROM taxon</div><div class=""> INNER JOIN taxon AS include ON</div><div class=""> (include.left_value BETWEEN taxon.left_value</div><div class=""> AND taxon.right_value)</div><div class=""> WHERE taxon.taxon_id IN</div><div class=""> (SELECT taxon_id FROM taxon_name</div><div class=""> WHERE name LIKE '%fungi%');</div></div><div class=""><div class="">$ sqlite3 test.db < cmds.sql | grep -w 9595</div></div><div class="">9595</div><div class="">```</div><div class=""><br class=""></div><div class="">Could this be a possible bug either in the biosql schema, the Perl script to load the taxonomy using the SQLite driver or in the example commands in the website? </div><div class=""><br class=""></div><div class="">Many thanks,</div><div class="">Pedro</div><div class=""><br class=""></div><div class=""><br class=""></div><br class=""><div class=""><blockquote type="cite" class=""><div class="">On 13 Jun 2019, at 11:52, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" class="">p.j.a.cock@googlemail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">Hi Pedro,<br class=""><br class="">It might help to double check a specific example where you know the<br class="">answer you want,<br class="">and how it was represented in the original sequence file. I assume you<br class="">are using a<br class="">GenBank or EMLB plain text file to load into BioSQL? If so, there is<br class="">some taxonomy<br class="">information in the record header, but also in the source feature(s).<br class=""><br class="">(There is usually a single source feature, the first feature in the<br class="">feature table. However,<br class="">with a chimera you can see multiple source features with different<br class="">taxonomy info.)<br class=""><br class="">I forget now exactly how these two sources of taxonomy data are<br class="">recorded in the database,<br class="">but it may be for your intended search, you need to search via the<br class="">feature table qualifiers?<br class=""><br class="">Peter<br class=""><br class="">On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <<a href="mailto:p.almeida.mc@gmail.com" class="">p.almeida.mc@gmail.com</a>> wrote:<br class=""><blockquote type="cite" class=""><br class="">Hi everyone,<br class=""><br class="">I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.<br class=""><br class="">Building on the example of the [BioSQL schema overview](<a href="https://biosql.org/wiki/Schema_Overview" class="">https://biosql.org/wiki/Schema_Overview</a>) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.<br class=""><br class="">The example provided by BioSQL is as follows:<br class=""><br class="">```<br class="">SELECT DISTINCT<br class=""> include.ncbi_taxon_id<br class="">FROM<br class=""> taxon<br class="">INNER JOIN<br class=""> taxon AS include<br class=""> ON (include.left_value BETWEEN taxon.left_value<br class=""> AND taxon.right_value)<br class=""> WHERE<br class=""> taxon.taxon_id<br class=""> IN (SELECT taxon_id FROM taxon_name<br class=""> WHERE name LIKE '%fungi%')<br class="">LIMIT 3;<br class="">```<br class=""><br class="">the first three results of this query in my local DB are the following:<br class=""><br class="">```<br class="">4751<br class="">57731<br class="">42900<br class="">```<br class=""><br class="">So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.<br class=""><br class="">However, when I try:<br class=""><br class="">```<br class="">SELECT DISTINCT<br class=""> be.accession,<br class=""> be.description,<br class=""> bs.seq<br class="">FROM<br class=""> bioentry AS be<br class=""> INNER JOIN<br class=""> biosequence AS bs USING (bioentry_id) -- be and bs are linked by bioentry_id<br class=""> INNER JOIN<br class=""> taxon USING (taxon_id) -- taxon doesn't have a FK with bientry? (is this the problem?)<br class=""> INNER JOIN taxon AS include<br class=""> ON (include.left_value BETWEEN taxon.left_value<br class=""> AND taxon.right_value)<br class="">WHERE<br class=""> taxon.taxon_id<br class=""> IN (SELECT taxon_id FROM taxon_name<br class=""> WHERE name LIKE '%fungi%')<br class="">LIMIT 3;<br class="">```<br class=""><br class="">the result is empty…<br class=""><br class="">I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?<br class="">So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?<br class=""><br class=""><br class="">Some background:<br class=""><br class="">- I'm using sqlite3<br class="">- loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script<br class="">- I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)<br class="">- the BioSQL database is filled with these identifiers using [BioPython](<a href="https://biopython.org/wiki/BioSQL" class="">https://biopython.org/wiki/BioSQL</a>) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)<br class=""><br class="">Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:<br class=""><br class="">```<br class="">SELECT<br class=""> be.accession,<br class=""> be.description,<br class=""> substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq<br class="">FROM<br class=""> biosequence bs<br class=""> INNER JOIN<br class=""> bioentry be USING (bioentry_id)<br class=""> INNER JOIN<br class=""> biodatabase bd USING (biodatabase_id)<br class="">WHERE<br class=""> be.accession = 'XM_024821459';<br class="">```<br class=""><br class="">which returns a fungal record:<br class=""><br class="">```<br class="">XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...<br class="">```<br class=""><br class="">Many thanks,<br class="">Pedro<br class=""><br class=""><br class="">_______________________________________________<br class="">BioSQL-l mailing list<br class=""><a href="mailto:BioSQL-l@mailman.open-bio.org" class="">BioSQL-l@mailman.open-bio.org</a><br class=""><a href="https://mailman.open-bio.org/mailman/listinfo/biosql-l" class="">https://mailman.open-bio.org/mailman/listinfo/biosql-l</a><br class=""></blockquote></div></div></blockquote></div><br class=""></div>_______________________________________________<br class="">BioSQL-l mailing list<br class=""><a href="mailto:BioSQL-l@mailman.open-bio.org" class="">BioSQL-l@mailman.open-bio.org</a><br class="">https://mailman.open-bio.org/mailman/listinfo/biosql-l</div></blockquote></div><br class=""><div class="">
<div class=""><div style="font-family: Monaco; font-size: 11px; orphans: 2; widows: 2;" class="">-- </div><div style="font-family: Monaco; font-size: 11px; orphans: 2; widows: 2;" class="">Hilmar Lapp -:- <a href="http://lappland.io" class="">lappland.io</a></div></div><div class=""><br class=""></div><br class="Apple-interchange-newline">
</div>
<br class=""></div></div></body></html>