<div><div dir="auto">Ok, good - as I recall the <a href="http://load_ncbi_taxonomy.pl">load_ncbi_taxonomy.pl</a> script does that properly.</div></div><div dir="auto"><br></div><div dir="auto">Well, at least that excludes one possible problem.</div><div dir="auto"><br></div><div dir="auto">Peter</div><div dir="auto"><br></div><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 10 Jun 2019 at 23:07, Pedro Almeida <<a href="mailto:p.almeida.mc@gmail.com">p.almeida.mc@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Peter,<br>

<br>

many thanks for your reply.<br>

<br>

OK, doing the query below for a fungal accession (e.g. XM_024821459) I do get taxon_left and taxon_right values, so I’d say these fields have been computed:<br>

<br>

```<br>

.mode column<br>

.headers on<br>

SELECT<br>

  be.accession,<br>

  be.description,<br>

  tx.ncbi_taxon_id,<br>

  tx.parent_taxon_id,<br>

  tx.left_value,<br>

  tx.right_value<br>

FROM<br>

  bioentry AS be<br>

  INNER JOIN<br>

  taxon AS tx USING (taxon_id)<br>

WHERE<br>

  be.accession = 'XM_024821459';<br>

```<br>

<br>

Produces the following result:<br>

```<br>

accession           description                                                                                                                               taxon_id   ncbi_taxon_id   parent_taxon_id   left_value   right_value<br>

------------             ---------------------------------------------------------------------------------                                               ----------     -------------         ---------------           ----------       -----------<br>

XM_024821459  Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA  1102477   1392255           284954                 1949071     1949072<br>

```<br>

<br>

Trying to go backwards, from the parent_taxon_id<br>

<br>

```<br>

.mode column<br>

.headers on<br>

SELECT<br>

  tx.ncbi_taxon_id,<br>

  tx.parent_taxon_id,<br>

  tx.left_value,<br>

  tx.right_value<br>

FROM<br>

  taxon AS tx<br>

WHERE<br>

  tx.taxon_id = '284954';<br>

```<br>

<br>

This gets to the species level:<br>

<br>

ncbi_taxon_id   parent_taxon_id  left_value  right_value<br>

-------------         ---------------          ----------      -----------<br>

340412            4125                    1949070    1949073<br>

<br>

<br>

And repeating again for parent 4125, results in the ncbi_taxon_id 5052, which is the genus.<br>

<br>

ncbi_taxon_id  parent_taxon_id  left_value   right_value<br>

-------------        ---------------          ----------       -----------<br>

5052               878125                1948391     1952734<br>

<br>

>From these queries it seems the left and right values are being computed properly.<br>

<br>

I used the BioSQL `<a href="http://load_ncbi_taxonomy.pl" rel="noreferrer" target="_blank">load_ncbi_taxonomy.pl</a>` script to load the taxonomy.<br>

<br>

<br>

Pedro<br>

<br>

<br>

<br>

<br>

> On 10 Jun 2019, at 12:07, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>> wrote:<br>

> <br>

> Hi Pedro,<br>

> <br>

> Can you double check that the taxon_left and taxon_right values have<br>

> been computed?<br>

> <br>

> They are essentially an optional index to enable efficient looks like<br>

> you are trying,<br>

> but are computed from the simple parent/child relationships.<br>

> <br>

> The answer to this probably depends on how you have loaded in the taxonomy,<br>

> the BioSQL provided scripts will do it for you - although Biopython does not use<br>

> them - see <a href="https://biopython.org/wiki/BioSQL" rel="noreferrer" target="_blank">https://biopython.org/wiki/BioSQL</a><br>

> <br>

> Peter<br>

> <br>

> On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <<a href="mailto:p.almeida.mc@gmail.com" target="_blank">p.almeida.mc@gmail.com</a>> wrote:<br>

>> <br>

>> Hi everyone,<br>

>> <br>

>> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.<br>

>> <br>

>> Building on the example of the [BioSQL schema overview](<a href="https://biosql.org/wiki/Schema_Overview" rel="noreferrer" target="_blank">https://biosql.org/wiki/Schema_Overview</a>) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.<br>

>> <br>

>> The example provided by BioSQL is as follows:<br>

>> <br>

>> ```<br>

>> SELECT DISTINCT<br>

>>        include.ncbi_taxon_id<br>

>> FROM<br>

>>    taxon<br>

>> INNER JOIN<br>

>>        taxon AS include<br>

>>        ON (include.left_value BETWEEN taxon.left_value<br>

>>                AND taxon.right_value)<br>

>> WHERE<br>

>>        taxon.taxon_id<br>

>>        IN (SELECT taxon_id FROM taxon_name<br>

>>                WHERE name LIKE '%fungi%')<br>

>> LIMIT 3;<br>

>> ```<br>

>> <br>

>> the first three results of this query in my local DB are the following:<br>

>> <br>

>> ```<br>

>> 4751<br>

>> 57731<br>

>> 42900<br>

>> ```<br>

>> <br>

>> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.<br>

>> <br>

>> However, when I try:<br>

>> <br>

>> ```<br>

>> SELECT DISTINCT<br>

>>        be.accession,<br>

>>        be.description,<br>

>>        bs.seq<br>

>> FROM<br>

>>        bioentry AS be<br>

>>        INNER JOIN<br>

>>                biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id<br>

>>        INNER JOIN<br>

>>                taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)<br>

>>        INNER JOIN taxon AS include<br>

>>                ON (include.left_value BETWEEN taxon.left_value<br>

>>                        AND taxon.right_value)<br>

>> WHERE<br>

>>        taxon.taxon_id<br>

>>        IN (SELECT taxon_id FROM taxon_name<br>

>>                WHERE name LIKE '%fungi%')<br>

>> LIMIT 3;<br>

>> ```<br>

>> <br>

>> the result is empty…<br>

>> <br>

>> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?<br>

>> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?<br>

>> <br>

>> <br>

>> Some background:<br>

>> <br>

>> - I'm using sqlite3<br>

>> - loaded the taxonomy as per BioSQL instructions using the `<a href="http://load_ncbi_taxonomy.pl" rel="noreferrer" target="_blank">load_ncbi_taxonomy.pl</a>` script<br>

>> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)<br>

>> - the BioSQL database is filled with these identifiers using [BioPython](<a href="https://biopython.org/wiki/BioSQL" rel="noreferrer" target="_blank">https://biopython.org/wiki/BioSQL</a>) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)<br>

>> <br>

>> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:<br>

>> <br>

>> ```<br>

>> SELECT<br>

>>   be.accession,<br>

>>   be.description,<br>

>>   substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq<br>

>> FROM<br>

>>   biosequence bs<br>

>>   INNER JOIN<br>

>>      bioentry be USING (bioentry_id)<br>

>>   INNER JOIN<br>

>>      biodatabase bd USING (biodatabase_id)<br>

>> WHERE<br>

>>   be.accession = 'XM_024821459';<br>

>> ```<br>

>> <br>

>> which returns a fungal record:<br>

>> <br>

>> ```<br>

>> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...<br>

>> ```<br>

>> <br>

>> Many thanks,<br>

>> Pedro<br>

>> <br>

>> <br>

>> _______________________________________________<br>

>> BioSQL-l mailing list<br>

>> <a href="mailto:BioSQL-l@mailman.open-bio.org" target="_blank">BioSQL-l@mailman.open-bio.org</a><br>

>> <a href="https://mailman.open-bio.org/mailman/listinfo/biosql-l" rel="noreferrer" target="_blank">https://mailman.open-bio.org/mailman/listinfo/biosql-l</a><br>

<br>

</blockquote></div></div>