[BioSQL-l] Select all sequences querying on taxon_name

Peter Cock p.j.a.cock at googlemail.com
Mon Jun 10 11:07:11 UTC 2019


Hi Pedro,

Can you double check that the taxon_left and taxon_right values have
been computed?

They are essentially an optional index to enable efficient looks like
you are trying,
but are computed from the simple parent/child relationships.

The answer to this probably depends on how you have loaded in the taxonomy,
the BioSQL provided scripts will do it for you - although Biopython does not use
them - see https://biopython.org/wiki/BioSQL

Peter

On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>
> Hi everyone,
>
> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>
> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>
> The example provided by BioSQL is as follows:
>
> ```
> SELECT DISTINCT
>         include.ncbi_taxon_id
> FROM
>     taxon
> INNER JOIN
>         taxon AS include
>         ON (include.left_value BETWEEN taxon.left_value
>                 AND taxon.right_value)
>  WHERE
>         taxon.taxon_id
>         IN (SELECT taxon_id FROM taxon_name
>                 WHERE name LIKE '%fungi%')
> LIMIT 3;
> ```
>
> the first three results of this query in my local DB are the following:
>
> ```
> 4751
> 57731
> 42900
> ```
>
> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>
> However, when I try:
>
> ```
> SELECT DISTINCT
>         be.accession,
>         be.description,
>         bs.seq
> FROM
>         bioentry AS be
>         INNER JOIN
>                 biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id
>         INNER JOIN
>                 taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)
>         INNER JOIN taxon AS include
>                 ON (include.left_value BETWEEN taxon.left_value
>                         AND taxon.right_value)
> WHERE
>         taxon.taxon_id
>         IN (SELECT taxon_id FROM taxon_name
>                 WHERE name LIKE '%fungi%')
> LIMIT 3;
> ```
>
> the result is empty…
>
> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>
>
> Some background:
>
> - I'm using sqlite3
> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>
> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>
> ```
> SELECT
>    be.accession,
>    be.description,
>    substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
> FROM
>    biosequence bs
>    INNER JOIN
>       bioentry be USING (bioentry_id)
>    INNER JOIN
>       biodatabase bd USING (biodatabase_id)
> WHERE
>    be.accession = 'XM_024821459';
> ```
>
> which returns a fungal record:
>
> ```
> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
> ```
>
> Many thanks,
> Pedro
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at mailman.open-bio.org
> https://mailman.open-bio.org/mailman/listinfo/biosql-l



More information about the BioSQL-l mailing list