[BioSQL-l] Select all sequences querying on taxon_name

Hilmar Lapp hlapp at drycafe.net
Tue Jul 2 23:14:31 UTC 2019


So it looks like the query is case-insensitive with SQLite. Near the top of your list you have "15464|Fungi/Metazoa group”. It least if the group includes what it says in the name it does (namely Fungi and Metazoa), then Gorilla would be a descendant of that group. You can test that in your query.

I agree that the example query is not a good one. It seems that it breaks down badly with case-insensitive search, and these days even if strictly case-sensitive still hits a lot of taxa that aren’t even fungi, like some beetles (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1881741). These examples were written at a time when there were *much* fewer organisms with a sequence in GenBank. Quite amazing what’s in there now.

  -hilmaer

> On Jul 2, 2019, at 7:01 PM, Pedro Almeida <p.almeida.mc at gmail.com> wrote:
> 
> OK, I think I might be starting to understand why we could be seeing other lineages besides Fungi with this type of query. Because we’re searching for a matching term in the taxon name for “fungi" this will include any lineages leading to the terminal branches which names do indeed include the token fungi, in one way or another. Querying for “SELECT taxon_id, name FROM taxon_name WHERE name LIKE '%fungi%’” did return a few Diptera, wheat, termites, corals and molluscs (I’m attaching the results in case they may be useful). Not vertebrates though, but I could see that some form of recursion could make the query end up in primates.
> 
> Maybe it could be relevant to consider rephrasing this particular snippet in the BioSQL Schema Overview page to reflect that this query will also include in the results other terminal branches outside the kingdom Fungi?
> 
> The best approach for these kind of cases could actually be to select all the terminal nodes from up a certain NCBI taxon id (e.g. 4751 for Fungi). Not sure if this is possible but that might be for another question. :)
> 
> Many thanks,
> Pedro
> 
> 
> <results.biosql_search.taxon_name.name-Like-fungi.txt>
> 
> 
> 
>> On 2 Jul 2019, at 22:58, Hilmar Lapp <hlapp at drycafe.net <mailto:hlapp at drycafe.net>> wrote:
>> 
>> Do you know what the "SELECT taxon_id FROM taxon_name WHERE name LIKE '%fungi%’” returns? If you run
>> 
>> SELECT taxon_id, name FROM taxon_name WHERE name LIKE '%fungi%’
>> 
>> What is the result?
>> 
>> One note to be kept in mind is that there are nodes in the NCBI taxonomy that do not appear in the NCBI taxonomy browser. (They are actually marked with a flag in the download.) So just because on the NCBI taxonomy browser the lineage up from Gorilla gorilla gorilla does not show a taxon with “fungi” somewhere in the name doesn’t mean there can’t be one. Notably, Opisthokonta are in the lineage, and they do include Fungi. So it’s not entirely out of the question that there isn’t a hidden node there that has “fungi” in the name and is an ancestor to Gorilla gorilla gorilla.
>> 
>> Perhaps you have checked that, but this would be the Occam’s razor explanation, so it should be disproven first.
>> 
>>   -hilmar
>> 
>>> On Jul 1, 2019, at 6:09 PM, Pedro Almeida <p.almeida.mc at gmail.com <mailto:p.almeida.mc at gmail.com>> wrote:
>>> 
>>> Hi Peter,
>>> 
>>> apologies for taking so long to reply but was caught in the middle of other projects.
>>> 
>>> Many thanks for your suggestion. So I went back to the very beginning and found what seems to be something odd (?).
>>> 
>>> I’m using SQLite version `3.28.0 2019-04-16 19:49:53 884b4b7e502b4e991677b53971277adfaf0a04a284f8e483e2553d0f83156b50`.
>>> 
>>> Starting from the instructions to set up an initial database with BioSQL, I did the following steps without loading any actual records besides the taxonomy data.
>>> 
>>> ```
>>> $ sqlite3 test.db < biosqldb-sqlite.sql
>>> $ perl load_ncbi_taxonomy.pl --dbname test.db --driver SQLite —download
>>> Downloading NCBI taxon database to taxdata
>>> Loading NCBI taxon database in taxdata:
>>> 	... retrieving all taxon nodes in the database
>>> 	... reading in taxon nodes from nodes.dmp
>>> 	... insert / update / delete taxon nodes
>>> 	... updating new parent IDs
>>> 	... (committing nodes)
>>> 	... rebuilding nested set left/right values
>>> 	... reading in taxon names from names.dmp
>>> 	... deleting old taxon names
>>> 	... inserting new taxon names
>>> 	... cleaning up
>>> Done.
>>> $ echo $?
>>> 0
>>> ```
>>> 
>>> This finished in ~10 min without any erros.
>>> 
>>> I then used the exact command to "list all species(final nodes) of Fungi (so no all hierarchy of parent nodes but lists the downstream, child path instead):” fully copy/pasted as in https://biosql.org/wiki/Schema_Overview <https://biosql.org/wiki/Schema_Overview>. I believe these are SQL commands and the syntax should be exactly the same in SQLite.
>>> 
>>> ```
>>> SELECT DISTINCT include.ncbi_taxon_id FROM taxon
>>>     INNER JOIN taxon AS include ON
>>>       (include.left_value BETWEEN taxon.left_value
>>>         AND taxon.right_value)
>>>  WHERE taxon.taxon_id IN
>>>    (SELECT taxon_id FROM taxon_name
>>>     WHERE name LIKE '%fungi%')
>>> ```
>>> 
>>> I would expect these commands to return the `ncbi_taxon_id` which should correspond to the UID of the NCBI Taxonomy Browser; for fungi only because that is what we are searching here. If this is true then it seems that something is not working with either one of the three steps above. For example, we know that Gorilla gorilla gorilla is listed in the NCBI taxonomy with the UID 9595 (https://www.ncbi.nlm.nih.gov/taxonomy/?term=9595%5Buid%5D <https://www.ncbi.nlm.nih.gov/taxonomy/?term=9595[uid]>) but running the above SQL commands from a file and grep'ing for 9595, gives a match which was not expected.
>>> 
>>> ```
>>> $ cat cmds.sql
>>> SELECT DISTINCT include.ncbi_taxon_id FROM taxon
>>>     INNER JOIN taxon AS include ON
>>>       (include.left_value BETWEEN taxon.left_value
>>>         AND taxon.right_value)
>>>  WHERE taxon.taxon_id IN
>>>    (SELECT taxon_id FROM taxon_name
>>>     WHERE name LIKE '%fungi%');
>>> $ sqlite3 test.db < cmds.sql | grep -w 9595
>>> 9595
>>> ```
>>> 
>>> Could this be a possible bug either in the biosql schema, the Perl script to load the taxonomy using the SQLite driver or in the example commands in the website? 
>>> 
>>> Many thanks,
>>> Pedro
>>> 
>>> 
>>> 
>>>> On 13 Jun 2019, at 11:52, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>>> 
>>>> Hi Pedro,
>>>> 
>>>> It might help to double check a specific example where you know the
>>>> answer you want,
>>>> and how it was represented in the original sequence file. I assume you
>>>> are using a
>>>> GenBank or EMLB plain text file to load into BioSQL? If so, there is
>>>> some taxonomy
>>>> information in the record header, but also in the source feature(s).
>>>> 
>>>> (There is usually a single source feature, the first feature in the
>>>> feature table. However,
>>>> with a chimera you can see multiple source features with different
>>>> taxonomy info.)
>>>> 
>>>> I forget now exactly how these two sources of taxonomy data are
>>>> recorded in the database,
>>>> but it may be for your intended search, you need to search via the
>>>> feature table qualifiers?
>>>> 
>>>> Peter
>>>> 
>>>> On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com <mailto:p.almeida.mc at gmail.com>> wrote:
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>>>>> 
>>>>> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview <https://biosql.org/wiki/Schema_Overview>) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>>>>> 
>>>>> The example provided by BioSQL is as follows:
>>>>> 
>>>>> ```
>>>>> SELECT DISTINCT
>>>>>        include.ncbi_taxon_id
>>>>> FROM
>>>>>    taxon
>>>>> INNER JOIN
>>>>>        taxon AS include
>>>>>        ON (include.left_value BETWEEN taxon.left_value
>>>>>                AND taxon.right_value)
>>>>> WHERE
>>>>>        taxon.taxon_id
>>>>>        IN (SELECT taxon_id FROM taxon_name
>>>>>                WHERE name LIKE '%fungi%')
>>>>> LIMIT 3;
>>>>> ```
>>>>> 
>>>>> the first three results of this query in my local DB are the following:
>>>>> 
>>>>> ```
>>>>> 4751
>>>>> 57731
>>>>> 42900
>>>>> ```
>>>>> 
>>>>> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>>>>> 
>>>>> However, when I try:
>>>>> 
>>>>> ```
>>>>> SELECT DISTINCT
>>>>>        be.accession,
>>>>>        be.description,
>>>>>        bs.seq
>>>>> FROM
>>>>>        bioentry AS be
>>>>>        INNER JOIN
>>>>>                biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id
>>>>>        INNER JOIN
>>>>>                taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)
>>>>>        INNER JOIN taxon AS include
>>>>>                ON (include.left_value BETWEEN taxon.left_value
>>>>>                        AND taxon.right_value)
>>>>> WHERE
>>>>>        taxon.taxon_id
>>>>>        IN (SELECT taxon_id FROM taxon_name
>>>>>                WHERE name LIKE '%fungi%')
>>>>> LIMIT 3;
>>>>> ```
>>>>> 
>>>>> the result is empty…
>>>>> 
>>>>> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
>>>>> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>>>>> 
>>>>> 
>>>>> Some background:
>>>>> 
>>>>> - I'm using sqlite3
>>>>> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
>>>>> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
>>>>> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL <https://biopython.org/wiki/BioSQL>) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>>>>> 
>>>>> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>>>>> 
>>>>> ```
>>>>> SELECT
>>>>>   be.accession,
>>>>>   be.description,
>>>>>   substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
>>>>> FROM
>>>>>   biosequence bs
>>>>>   INNER JOIN
>>>>>      bioentry be USING (bioentry_id)
>>>>>   INNER JOIN
>>>>>      biodatabase bd USING (biodatabase_id)
>>>>> WHERE
>>>>>   be.accession = 'XM_024821459';
>>>>> ```
>>>>> 
>>>>> which returns a fungal record:
>>>>> 
>>>>> ```
>>>>> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
>>>>> ```
>>>>> 
>>>>> Many thanks,
>>>>> Pedro
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> BioSQL-l mailing list
>>>>> BioSQL-l at mailman.open-bio.org <mailto:BioSQL-l at mailman.open-bio.org>
>>>>> https://mailman.open-bio.org/mailman/listinfo/biosql-l <https://mailman.open-bio.org/mailman/listinfo/biosql-l>
>>> 
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at mailman.open-bio.org <mailto:BioSQL-l at mailman.open-bio.org>
>>> https://mailman.open-bio.org/mailman/listinfo/biosql-l <https://mailman.open-bio.org/mailman/listinfo/biosql-l>
>> -- 
>> Hilmar Lapp -:- lappland.io <http://lappland.io/>
>> 
>> 
>> 
> 

-- 
Hilmar Lapp -:- lappland.io



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biosql-l/attachments/20190702/f9e35f11/attachment-0001.htm>


More information about the BioSQL-l mailing list