[BioSQL-l] Select all sequences querying on taxon_name

Pedro Almeida p.almeida.mc at gmail.com
Mon Jun 10 22:07:56 UTC 2019


Hi Peter,

many thanks for your reply.

OK, doing the query below for a fungal accession (e.g. XM_024821459) I do get taxon_left and taxon_right values, so I’d say these fields have been computed:

```
.mode column
.headers on
SELECT
  be.accession,
  be.description,
  tx.ncbi_taxon_id,
  tx.parent_taxon_id,
  tx.left_value,
  tx.right_value
FROM
  bioentry AS be
  INNER JOIN
  taxon AS tx USING (taxon_id)
WHERE
  be.accession = 'XM_024821459';
```

Produces the following result:
```
accession           description                                                                                                                               taxon_id   ncbi_taxon_id   parent_taxon_id   left_value   right_value
------------             ---------------------------------------------------------------------------------                                               ----------     -------------         ---------------           ----------       -----------
XM_024821459  Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA  1102477   1392255           284954                 1949071     1949072
```

Trying to go backwards, from the parent_taxon_id

```
.mode column
.headers on
SELECT
  tx.ncbi_taxon_id,
  tx.parent_taxon_id,
  tx.left_value,
  tx.right_value
FROM
  taxon AS tx
WHERE
  tx.taxon_id = '284954';
```

This gets to the species level:

ncbi_taxon_id   parent_taxon_id  left_value  right_value
-------------         ---------------          ----------      -----------
340412            4125                    1949070    1949073


And repeating again for parent 4125, results in the ncbi_taxon_id 5052, which is the genus.

ncbi_taxon_id  parent_taxon_id  left_value   right_value
-------------        ---------------          ----------       -----------
5052               878125                1948391     1952734

>From these queries it seems the left and right values are being computed properly.

I used the BioSQL `load_ncbi_taxonomy.pl` script to load the taxonomy.


Pedro




> On 10 Jun 2019, at 12:07, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
> Hi Pedro,
> 
> Can you double check that the taxon_left and taxon_right values have
> been computed?
> 
> They are essentially an optional index to enable efficient looks like
> you are trying,
> but are computed from the simple parent/child relationships.
> 
> The answer to this probably depends on how you have loaded in the taxonomy,
> the BioSQL provided scripts will do it for you - although Biopython does not use
> them - see https://biopython.org/wiki/BioSQL
> 
> Peter
> 
> On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>> 
>> Hi everyone,
>> 
>> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>> 
>> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>> 
>> The example provided by BioSQL is as follows:
>> 
>> ```
>> SELECT DISTINCT
>>        include.ncbi_taxon_id
>> FROM
>>    taxon
>> INNER JOIN
>>        taxon AS include
>>        ON (include.left_value BETWEEN taxon.left_value
>>                AND taxon.right_value)
>> WHERE
>>        taxon.taxon_id
>>        IN (SELECT taxon_id FROM taxon_name
>>                WHERE name LIKE '%fungi%')
>> LIMIT 3;
>> ```
>> 
>> the first three results of this query in my local DB are the following:
>> 
>> ```
>> 4751
>> 57731
>> 42900
>> ```
>> 
>> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>> 
>> However, when I try:
>> 
>> ```
>> SELECT DISTINCT
>>        be.accession,
>>        be.description,
>>        bs.seq
>> FROM
>>        bioentry AS be
>>        INNER JOIN
>>                biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id
>>        INNER JOIN
>>                taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)
>>        INNER JOIN taxon AS include
>>                ON (include.left_value BETWEEN taxon.left_value
>>                        AND taxon.right_value)
>> WHERE
>>        taxon.taxon_id
>>        IN (SELECT taxon_id FROM taxon_name
>>                WHERE name LIKE '%fungi%')
>> LIMIT 3;
>> ```
>> 
>> the result is empty…
>> 
>> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
>> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>> 
>> 
>> Some background:
>> 
>> - I'm using sqlite3
>> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
>> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
>> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>> 
>> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>> 
>> ```
>> SELECT
>>   be.accession,
>>   be.description,
>>   substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
>> FROM
>>   biosequence bs
>>   INNER JOIN
>>      bioentry be USING (bioentry_id)
>>   INNER JOIN
>>      biodatabase bd USING (biodatabase_id)
>> WHERE
>>   be.accession = 'XM_024821459';
>> ```
>> 
>> which returns a fungal record:
>> 
>> ```
>> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
>> ```
>> 
>> Many thanks,
>> Pedro
>> 
>> 
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at mailman.open-bio.org
>> https://mailman.open-bio.org/mailman/listinfo/biosql-l




More information about the BioSQL-l mailing list