[BioSQL-l] Select all sequences querying on taxon_name

Mon Jul 1 22:09:35 UTC 2019

Hi Peter,

apologies for taking so long to reply but was caught in the middle of other projects.

Many thanks for your suggestion. So I went back to the very beginning and found what seems to be something odd (?).

I’m using SQLite version `3.28.0 2019-04-16 19:49:53 884b4b7e502b4e991677b53971277adfaf0a04a284f8e483e2553d0f83156b50`.

Starting from the instructions to set up an initial database with BioSQL, I did the following steps without loading any actual records besides the taxonomy data.

```
$ sqlite3 test.db < biosqldb-sqlite.sql
$ perl load_ncbi_taxonomy.pl --dbname test.db --driver SQLite —download
Downloading NCBI taxon database to taxdata
Loading NCBI taxon database in taxdata:
	... retrieving all taxon nodes in the database
	... reading in taxon nodes from nodes.dmp
	... insert / update / delete taxon nodes
	... updating new parent IDs
	... (committing nodes)
	... rebuilding nested set left/right values
	... reading in taxon names from names.dmp
	... deleting old taxon names
	... inserting new taxon names
	... cleaning up
Done.
$ echo $?
0
```

This finished in ~10 min without any erros.

I then used the exact command to "list all species(final nodes) of Fungi (so no all hierarchy of parent nodes but lists the downstream, child path instead):” fully copy/pasted as in https://biosql.org/wiki/Schema_Overview <https://biosql.org/wiki/Schema_Overview>. I believe these are SQL commands and the syntax should be exactly the same in SQLite.

```
SELECT DISTINCT include.ncbi_taxon_id FROM taxon
    INNER JOIN taxon AS include ON
      (include.left_value BETWEEN taxon.left_value
        AND taxon.right_value)
 WHERE taxon.taxon_id IN
   (SELECT taxon_id FROM taxon_name
    WHERE name LIKE '%fungi%')
```

I would expect these commands to return the `ncbi_taxon_id` which should correspond to the UID of the NCBI Taxonomy Browser; for fungi only because that is what we are searching here. If this is true then it seems that something is not working with either one of the three steps above. For example, we know that Gorilla gorilla gorilla is listed in the NCBI taxonomy with the UID 9595 (https://www.ncbi.nlm.nih.gov/taxonomy/?term=9595%5Buid%5D <https://www.ncbi.nlm.nih.gov/taxonomy/?term=9595[uid]>) but running the above SQL commands from a file and grep'ing for 9595, gives a match which was not expected.

```
$ cat cmds.sql
SELECT DISTINCT include.ncbi_taxon_id FROM taxon
    INNER JOIN taxon AS include ON
      (include.left_value BETWEEN taxon.left_value
        AND taxon.right_value)
 WHERE taxon.taxon_id IN
   (SELECT taxon_id FROM taxon_name
    WHERE name LIKE '%fungi%');
$ sqlite3 test.db < cmds.sql | grep -w 9595
9595
```

Could this be a possible bug either in the biosql schema, the Perl script to load the taxonomy using the SQLite driver or in the example commands in the website? 

Many thanks,
Pedro

> On 13 Jun 2019, at 11:52, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
> Hi Pedro,
> 
> It might help to double check a specific example where you know the
> answer you want,
> and how it was represented in the original sequence file. I assume you
> are using a
> GenBank or EMLB plain text file to load into BioSQL? If so, there is
> some taxonomy
> information in the record header, but also in the source feature(s).
> 
> (There is usually a single source feature, the first feature in the
> feature table. However,
> with a chimera you can see multiple source features with different
> taxonomy info.)
> 
> I forget now exactly how these two sources of taxonomy data are
> recorded in the database,
> but it may be for your intended search, you need to search via the
> feature table qualifiers?
> 
> Peter
> 
> On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>> 
>> Hi everyone,
>> 
>> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>> 
>> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>> 
>> The example provided by BioSQL is as follows:
>> 
>> ```
>> SELECT DISTINCT
>>        include.ncbi_taxon_id
>> FROM
>>    taxon
>> INNER JOIN
>>        taxon AS include
>>        ON (include.left_value BETWEEN taxon.left_value
>>                AND taxon.right_value)
>> WHERE
>>        taxon.taxon_id
>>        IN (SELECT taxon_id FROM taxon_name
>>                WHERE name LIKE '%fungi%')
>> LIMIT 3;
>> ```
>> 
>> the first three results of this query in my local DB are the following:
>> 
>> ```
>> 4751
>> 57731
>> 42900
>> ```
>> 
>> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>> 
>> However, when I try:
>> 
>> ```
>> SELECT DISTINCT
>>        be.accession,
>>        be.description,
>>        bs.seq
>> FROM
>>        bioentry AS be
>>        INNER JOIN
>>                biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id
>>        INNER JOIN
>>                taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)
>>        INNER JOIN taxon AS include
>>                ON (include.left_value BETWEEN taxon.left_value
>>                        AND taxon.right_value)
>> WHERE
>>        taxon.taxon_id
>>        IN (SELECT taxon_id FROM taxon_name
>>                WHERE name LIKE '%fungi%')
>> LIMIT 3;
>> ```
>> 
>> the result is empty…
>> 
>> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
>> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>> 
>> 
>> Some background:
>> 
>> - I'm using sqlite3
>> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
>> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
>> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>> 
>> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>> 
>> ```
>> SELECT
>>   be.accession,
>>   be.description,
>>   substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
>> FROM
>>   biosequence bs
>>   INNER JOIN
>>      bioentry be USING (bioentry_id)
>>   INNER JOIN
>>      biodatabase bd USING (biodatabase_id)
>> WHERE
>>   be.accession = 'XM_024821459';
>> ```
>> 
>> which returns a fungal record:
>> 
>> ```
>> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
>> ```
>> 
>> Many thanks,
>> Pedro
>> 
>> 
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at mailman.open-bio.org
>> https://mailman.open-bio.org/mailman/listinfo/biosql-l

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biosql-l/attachments/20190701/64cca57c/attachment.htm>