[BioSQL-l] Select all sequences querying on taxon_name
Pedro Almeida
p.almeida.mc at gmail.com
Mon Jun 10 22:07:56 UTC 2019
Hi Peter,
many thanks for your reply.
OK, doing the query below for a fungal accession (e.g. XM_024821459) I do get taxon_left and taxon_right values, so I’d say these fields have been computed:
```
.mode column
.headers on
SELECT
be.accession,
be.description,
tx.ncbi_taxon_id,
tx.parent_taxon_id,
tx.left_value,
tx.right_value
FROM
bioentry AS be
INNER JOIN
taxon AS tx USING (taxon_id)
WHERE
be.accession = 'XM_024821459';
```
Produces the following result:
```
accession description taxon_id ncbi_taxon_id parent_taxon_id left_value right_value
------------ --------------------------------------------------------------------------------- ---------- ------------- --------------- ---------- -----------
XM_024821459 Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA 1102477 1392255 284954 1949071 1949072
```
Trying to go backwards, from the parent_taxon_id
```
.mode column
.headers on
SELECT
tx.ncbi_taxon_id,
tx.parent_taxon_id,
tx.left_value,
tx.right_value
FROM
taxon AS tx
WHERE
tx.taxon_id = '284954';
```
This gets to the species level:
ncbi_taxon_id parent_taxon_id left_value right_value
------------- --------------- ---------- -----------
340412 4125 1949070 1949073
And repeating again for parent 4125, results in the ncbi_taxon_id 5052, which is the genus.
ncbi_taxon_id parent_taxon_id left_value right_value
------------- --------------- ---------- -----------
5052 878125 1948391 1952734
>From these queries it seems the left and right values are being computed properly.
I used the BioSQL `load_ncbi_taxonomy.pl` script to load the taxonomy.
Pedro
> On 10 Jun 2019, at 12:07, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Hi Pedro,
>
> Can you double check that the taxon_left and taxon_right values have
> been computed?
>
> They are essentially an optional index to enable efficient looks like
> you are trying,
> but are computed from the simple parent/child relationships.
>
> The answer to this probably depends on how you have loaded in the taxonomy,
> the BioSQL provided scripts will do it for you - although Biopython does not use
> them - see https://biopython.org/wiki/BioSQL
>
> Peter
>
> On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>>
>> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>>
>> The example provided by BioSQL is as follows:
>>
>> ```
>> SELECT DISTINCT
>> include.ncbi_taxon_id
>> FROM
>> taxon
>> INNER JOIN
>> taxon AS include
>> ON (include.left_value BETWEEN taxon.left_value
>> AND taxon.right_value)
>> WHERE
>> taxon.taxon_id
>> IN (SELECT taxon_id FROM taxon_name
>> WHERE name LIKE '%fungi%')
>> LIMIT 3;
>> ```
>>
>> the first three results of this query in my local DB are the following:
>>
>> ```
>> 4751
>> 57731
>> 42900
>> ```
>>
>> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>>
>> However, when I try:
>>
>> ```
>> SELECT DISTINCT
>> be.accession,
>> be.description,
>> bs.seq
>> FROM
>> bioentry AS be
>> INNER JOIN
>> biosequence AS bs USING (bioentry_id) -- be and bs are linked by bioentry_id
>> INNER JOIN
>> taxon USING (taxon_id) -- taxon doesn't have a FK with bientry? (is this the problem?)
>> INNER JOIN taxon AS include
>> ON (include.left_value BETWEEN taxon.left_value
>> AND taxon.right_value)
>> WHERE
>> taxon.taxon_id
>> IN (SELECT taxon_id FROM taxon_name
>> WHERE name LIKE '%fungi%')
>> LIMIT 3;
>> ```
>>
>> the result is empty…
>>
>> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
>> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>>
>>
>> Some background:
>>
>> - I'm using sqlite3
>> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
>> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
>> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>>
>> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>>
>> ```
>> SELECT
>> be.accession,
>> be.description,
>> substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
>> FROM
>> biosequence bs
>> INNER JOIN
>> bioentry be USING (bioentry_id)
>> INNER JOIN
>> biodatabase bd USING (biodatabase_id)
>> WHERE
>> be.accession = 'XM_024821459';
>> ```
>>
>> which returns a fungal record:
>>
>> ```
>> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
>> ```
>>
>> Many thanks,
>> Pedro
>>
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at mailman.open-bio.org
>> https://mailman.open-bio.org/mailman/listinfo/biosql-l
More information about the BioSQL-l
mailing list