<div><div dir="auto">Ok, good - as I recall the <a href="http://load_ncbi_taxonomy.pl">load_ncbi_taxonomy.pl</a> script does that properly.</div></div><div dir="auto"><br></div><div dir="auto">Well, at least that excludes one possible problem.</div><div dir="auto"><br></div><div dir="auto">Peter</div><div dir="auto"><br></div><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 10 Jun 2019 at 23:07, Pedro Almeida <<a href="mailto:p.almeida.mc@gmail.com">p.almeida.mc@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Peter,<br>
<br>
many thanks for your reply.<br>
<br>
OK, doing the query below for a fungal accession (e.g. XM_024821459) I do get taxon_left and taxon_right values, so I’d say these fields have been computed:<br>
<br>
```<br>
.mode column<br>
.headers on<br>
SELECT<br>
be.accession,<br>
be.description,<br>
tx.ncbi_taxon_id,<br>
tx.parent_taxon_id,<br>
tx.left_value,<br>
tx.right_value<br>
FROM<br>
bioentry AS be<br>
INNER JOIN<br>
taxon AS tx USING (taxon_id)<br>
WHERE<br>
be.accession = 'XM_024821459';<br>
```<br>
<br>
Produces the following result:<br>
```<br>
accession description taxon_id ncbi_taxon_id parent_taxon_id left_value right_value<br>
------------ --------------------------------------------------------------------------------- ---------- ------------- --------------- ---------- -----------<br>
XM_024821459 Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA 1102477 1392255 284954 1949071 1949072<br>
```<br>
<br>
Trying to go backwards, from the parent_taxon_id<br>
<br>
```<br>
.mode column<br>
.headers on<br>
SELECT<br>
tx.ncbi_taxon_id,<br>
tx.parent_taxon_id,<br>
tx.left_value,<br>
tx.right_value<br>
FROM<br>
taxon AS tx<br>
WHERE<br>
tx.taxon_id = '284954';<br>
```<br>
<br>
This gets to the species level:<br>
<br>
ncbi_taxon_id parent_taxon_id left_value right_value<br>
------------- --------------- ---------- -----------<br>
340412 4125 1949070 1949073<br>
<br>
<br>
And repeating again for parent 4125, results in the ncbi_taxon_id 5052, which is the genus.<br>
<br>
ncbi_taxon_id parent_taxon_id left_value right_value<br>
------------- --------------- ---------- -----------<br>
5052 878125 1948391 1952734<br>
<br>
>From these queries it seems the left and right values are being computed properly.<br>
<br>
I used the BioSQL `<a href="http://load_ncbi_taxonomy.pl" rel="noreferrer" target="_blank">load_ncbi_taxonomy.pl</a>` script to load the taxonomy.<br>
<br>
<br>
Pedro<br>
<br>
<br>
<br>
<br>
> On 10 Jun 2019, at 12:07, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" target="_blank">p.j.a.cock@googlemail.com</a>> wrote:<br>
> <br>
> Hi Pedro,<br>
> <br>
> Can you double check that the taxon_left and taxon_right values have<br>
> been computed?<br>
> <br>
> They are essentially an optional index to enable efficient looks like<br>
> you are trying,<br>
> but are computed from the simple parent/child relationships.<br>
> <br>
> The answer to this probably depends on how you have loaded in the taxonomy,<br>
> the BioSQL provided scripts will do it for you - although Biopython does not use<br>
> them - see <a href="https://biopython.org/wiki/BioSQL" rel="noreferrer" target="_blank">https://biopython.org/wiki/BioSQL</a><br>
> <br>
> Peter<br>
> <br>
> On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <<a href="mailto:p.almeida.mc@gmail.com" target="_blank">p.almeida.mc@gmail.com</a>> wrote:<br>
>> <br>
>> Hi everyone,<br>
>> <br>
>> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.<br>
>> <br>
>> Building on the example of the [BioSQL schema overview](<a href="https://biosql.org/wiki/Schema_Overview" rel="noreferrer" target="_blank">https://biosql.org/wiki/Schema_Overview</a>) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.<br>
>> <br>
>> The example provided by BioSQL is as follows:<br>
>> <br>
>> ```<br>
>> SELECT DISTINCT<br>
>> include.ncbi_taxon_id<br>
>> FROM<br>
>> taxon<br>
>> INNER JOIN<br>
>> taxon AS include<br>
>> ON (include.left_value BETWEEN taxon.left_value<br>
>> AND taxon.right_value)<br>
>> WHERE<br>
>> taxon.taxon_id<br>
>> IN (SELECT taxon_id FROM taxon_name<br>
>> WHERE name LIKE '%fungi%')<br>
>> LIMIT 3;<br>
>> ```<br>
>> <br>
>> the first three results of this query in my local DB are the following:<br>
>> <br>
>> ```<br>
>> 4751<br>
>> 57731<br>
>> 42900<br>
>> ```<br>
>> <br>
>> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.<br>
>> <br>
>> However, when I try:<br>
>> <br>
>> ```<br>
>> SELECT DISTINCT<br>
>> be.accession,<br>
>> be.description,<br>
>> bs.seq<br>
>> FROM<br>
>> bioentry AS be<br>
>> INNER JOIN<br>
>> biosequence AS bs USING (bioentry_id) -- be and bs are linked by bioentry_id<br>
>> INNER JOIN<br>
>> taxon USING (taxon_id) -- taxon doesn't have a FK with bientry? (is this the problem?)<br>
>> INNER JOIN taxon AS include<br>
>> ON (include.left_value BETWEEN taxon.left_value<br>
>> AND taxon.right_value)<br>
>> WHERE<br>
>> taxon.taxon_id<br>
>> IN (SELECT taxon_id FROM taxon_name<br>
>> WHERE name LIKE '%fungi%')<br>
>> LIMIT 3;<br>
>> ```<br>
>> <br>
>> the result is empty…<br>
>> <br>
>> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?<br>
>> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?<br>
>> <br>
>> <br>
>> Some background:<br>
>> <br>
>> - I'm using sqlite3<br>
>> - loaded the taxonomy as per BioSQL instructions using the `<a href="http://load_ncbi_taxonomy.pl" rel="noreferrer" target="_blank">load_ncbi_taxonomy.pl</a>` script<br>
>> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)<br>
>> - the BioSQL database is filled with these identifiers using [BioPython](<a href="https://biopython.org/wiki/BioSQL" rel="noreferrer" target="_blank">https://biopython.org/wiki/BioSQL</a>) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)<br>
>> <br>
>> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:<br>
>> <br>
>> ```<br>
>> SELECT<br>
>> be.accession,<br>
>> be.description,<br>
>> substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq<br>
>> FROM<br>
>> biosequence bs<br>
>> INNER JOIN<br>
>> bioentry be USING (bioentry_id)<br>
>> INNER JOIN<br>
>> biodatabase bd USING (biodatabase_id)<br>
>> WHERE<br>
>> be.accession = 'XM_024821459';<br>
>> ```<br>
>> <br>
>> which returns a fungal record:<br>
>> <br>
>> ```<br>
>> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...<br>
>> ```<br>
>> <br>
>> Many thanks,<br>
>> Pedro<br>
>> <br>
>> <br>
>> _______________________________________________<br>
>> BioSQL-l mailing list<br>
>> <a href="mailto:BioSQL-l@mailman.open-bio.org" target="_blank">BioSQL-l@mailman.open-bio.org</a><br>
>> <a href="https://mailman.open-bio.org/mailman/listinfo/biosql-l" rel="noreferrer" target="_blank">https://mailman.open-bio.org/mailman/listinfo/biosql-l</a><br>
<br>
</blockquote></div></div>