[BioSQL-l] Select all sequences querying on taxon_name
Pedro Almeida
p.almeida.mc at gmail.com
Sat Jun 8 18:32:31 UTC 2019
Hi everyone,
I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
The example provided by BioSQL is as follows:
```
SELECT DISTINCT
include.ncbi_taxon_id
FROM
taxon
INNER JOIN
taxon AS include
ON (include.left_value BETWEEN taxon.left_value
AND taxon.right_value)
WHERE
taxon.taxon_id
IN (SELECT taxon_id FROM taxon_name
WHERE name LIKE '%fungi%')
LIMIT 3;
```
the first three results of this query in my local DB are the following:
```
4751
57731
42900
```
So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
However, when I try:
```
SELECT DISTINCT
be.accession,
be.description,
bs.seq
FROM
bioentry AS be
INNER JOIN
biosequence AS bs USING (bioentry_id) -- be and bs are linked by bioentry_id
INNER JOIN
taxon USING (taxon_id) -- taxon doesn't have a FK with bientry? (is this the problem?)
INNER JOIN taxon AS include
ON (include.left_value BETWEEN taxon.left_value
AND taxon.right_value)
WHERE
taxon.taxon_id
IN (SELECT taxon_id FROM taxon_name
WHERE name LIKE '%fungi%')
LIMIT 3;
```
the result is empty…
I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
Some background:
- I'm using sqlite3
- loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
- I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
- the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
```
SELECT
be.accession,
be.description,
substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
FROM
biosequence bs
INNER JOIN
bioentry be USING (bioentry_id)
INNER JOIN
biodatabase bd USING (biodatabase_id)
WHERE
be.accession = 'XM_024821459';
```
which returns a fungal record:
```
XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
```
Many thanks,
Pedro
More information about the BioSQL-l
mailing list