[BioSQL-l] Select all sequences querying on taxon_name
Peter Cock
p.j.a.cock at googlemail.com
Mon Jun 10 11:07:11 UTC 2019
Hi Pedro,
Can you double check that the taxon_left and taxon_right values have
been computed?
They are essentially an optional index to enable efficient looks like
you are trying,
but are computed from the simple parent/child relationships.
The answer to this probably depends on how you have loaded in the taxonomy,
the BioSQL provided scripts will do it for you - although Biopython does not use
them - see https://biopython.org/wiki/BioSQL
Peter
On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>
> Hi everyone,
>
> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>
> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>
> The example provided by BioSQL is as follows:
>
> ```
> SELECT DISTINCT
> include.ncbi_taxon_id
> FROM
> taxon
> INNER JOIN
> taxon AS include
> ON (include.left_value BETWEEN taxon.left_value
> AND taxon.right_value)
> WHERE
> taxon.taxon_id
> IN (SELECT taxon_id FROM taxon_name
> WHERE name LIKE '%fungi%')
> LIMIT 3;
> ```
>
> the first three results of this query in my local DB are the following:
>
> ```
> 4751
> 57731
> 42900
> ```
>
> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>
> However, when I try:
>
> ```
> SELECT DISTINCT
> be.accession,
> be.description,
> bs.seq
> FROM
> bioentry AS be
> INNER JOIN
> biosequence AS bs USING (bioentry_id) -- be and bs are linked by bioentry_id
> INNER JOIN
> taxon USING (taxon_id) -- taxon doesn't have a FK with bientry? (is this the problem?)
> INNER JOIN taxon AS include
> ON (include.left_value BETWEEN taxon.left_value
> AND taxon.right_value)
> WHERE
> taxon.taxon_id
> IN (SELECT taxon_id FROM taxon_name
> WHERE name LIKE '%fungi%')
> LIMIT 3;
> ```
>
> the result is empty…
>
> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>
>
> Some background:
>
> - I'm using sqlite3
> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>
> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>
> ```
> SELECT
> be.accession,
> be.description,
> substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
> FROM
> biosequence bs
> INNER JOIN
> bioentry be USING (bioentry_id)
> INNER JOIN
> biodatabase bd USING (biodatabase_id)
> WHERE
> be.accession = 'XM_024821459';
> ```
>
> which returns a fungal record:
>
> ```
> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
> ```
>
> Many thanks,
> Pedro
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at mailman.open-bio.org
> https://mailman.open-bio.org/mailman/listinfo/biosql-l
More information about the BioSQL-l
mailing list