[BioSQL-l] Select all sequences querying on taxon_name
Peter Cock
p.j.a.cock at googlemail.com
Mon Jun 10 22:59:31 UTC 2019
Ok, good - as I recall the load_ncbi_taxonomy.pl script does that properly.
Well, at least that excludes one possible problem.
Peter
On Mon, 10 Jun 2019 at 23:07, Pedro Almeida <p.almeida.mc at gmail.com> wrote:
> Hi Peter,
>
> many thanks for your reply.
>
> OK, doing the query below for a fungal accession (e.g. XM_024821459) I do
> get taxon_left and taxon_right values, so I’d say these fields have been
> computed:
>
> ```
> .mode column
> .headers on
> SELECT
> be.accession,
> be.description,
> tx.ncbi_taxon_id,
> tx.parent_taxon_id,
> tx.left_value,
> tx.right_value
> FROM
> bioentry AS be
> INNER JOIN
> taxon AS tx USING (taxon_id)
> WHERE
> be.accession = 'XM_024821459';
> ```
>
> Produces the following result:
> ```
> accession description
>
> taxon_id ncbi_taxon_id parent_taxon_id left_value right_value
> ------------
> ---------------------------------------------------------------------------------
> ---------- -------------
> --------------- ---------- -----------
> XM_024821459 Aspergillus novofumigatus IBT 16806 hypothetical protein
> (P174DRAFT_250220), mRNA 1102477 1392255 284954
> 1949071 1949072
> ```
>
> Trying to go backwards, from the parent_taxon_id
>
> ```
> .mode column
> .headers on
> SELECT
> tx.ncbi_taxon_id,
> tx.parent_taxon_id,
> tx.left_value,
> tx.right_value
> FROM
> taxon AS tx
> WHERE
> tx.taxon_id = '284954';
> ```
>
> This gets to the species level:
>
> ncbi_taxon_id parent_taxon_id left_value right_value
> ------------- --------------- ---------- -----------
> 340412 4125 1949070 1949073
>
>
> And repeating again for parent 4125, results in the ncbi_taxon_id 5052,
> which is the genus.
>
> ncbi_taxon_id parent_taxon_id left_value right_value
> ------------- --------------- ---------- -----------
> 5052 878125 1948391 1952734
>
> From these queries it seems the left and right values are being computed
> properly.
>
> I used the BioSQL `load_ncbi_taxonomy.pl` script to load the taxonomy.
>
>
> Pedro
>
>
>
>
> > On 10 Jun 2019, at 12:07, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> >
> > Hi Pedro,
> >
> > Can you double check that the taxon_left and taxon_right values have
> > been computed?
> >
> > They are essentially an optional index to enable efficient looks like
> > you are trying,
> > but are computed from the simple parent/child relationships.
> >
> > The answer to this probably depends on how you have loaded in the
> taxonomy,
> > the BioSQL provided scripts will do it for you - although Biopython does
> not use
> > them - see https://biopython.org/wiki/BioSQL
> >
> > Peter
> >
> > On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com>
> wrote:
> >>
> >> Hi everyone,
> >>
> >> I’m having a bit of difficulty in querying a local BioSQL database for
> specific sequences. I guess this is more related to my lack of knowledge
> with the SQL syntax, but thought that someone could jump in and give me a
> hand on this.
> >>
> >> Building on the example of the [BioSQL schema overview](
> https://biosql.org/wiki/Schema_Overview) to list all species(final nodes)
> of Fungi, I'd like to extend this query to retrieve all sequences in the DB
> for all species of Fungi.
> >>
> >> The example provided by BioSQL is as follows:
> >>
> >> ```
> >> SELECT DISTINCT
> >> include.ncbi_taxon_id
> >> FROM
> >> taxon
> >> INNER JOIN
> >> taxon AS include
> >> ON (include.left_value BETWEEN taxon.left_value
> >> AND taxon.right_value)
> >> WHERE
> >> taxon.taxon_id
> >> IN (SELECT taxon_id FROM taxon_name
> >> WHERE name LIKE '%fungi%')
> >> LIMIT 3;
> >> ```
> >>
> >> the first three results of this query in my local DB are the following:
> >>
> >> ```
> >> 4751
> >> 57731
> >> 42900
> >> ```
> >>
> >> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi
> environmental sequences and 42900 for an uncultured agricultural soil
> fungus 166r6.
> >>
> >> However, when I try:
> >>
> >> ```
> >> SELECT DISTINCT
> >> be.accession,
> >> be.description,
> >> bs.seq
> >> FROM
> >> bioentry AS be
> >> INNER JOIN
> >> biosequence AS bs USING (bioentry_id) -- be and bs are
> linked by bioentry_id
> >> INNER JOIN
> >> taxon USING (taxon_id) -- taxon doesn't have a FK with
> bientry? (is this the problem?)
> >> INNER JOIN taxon AS include
> >> ON (include.left_value BETWEEN taxon.left_value
> >> AND taxon.right_value)
> >> WHERE
> >> taxon.taxon_id
> >> IN (SELECT taxon_id FROM taxon_name
> >> WHERE name LIKE '%fungi%')
> >> LIMIT 3;
> >> ```
> >>
> >> the result is empty…
> >>
> >> I think the problem might be related with the fact that the `taxon`
> table doesn’t have a FK shared with the `bioentry` table. Is this correct?
> >> So my question is how can I workaround this and retrieve all fungal (or
> any other taxonomic group) sequences from a BioSQL DB?
> >>
> >>
> >> Some background:
> >>
> >> - I'm using sqlite3
> >> - loaded the taxonomy as per BioSQL instructions using the `
> load_ncbi_taxonomy.pl` script
> >> - I'm just using a subset of the NCBI `nt` database, currently with a
> random set of 1000 identifiers (this is just for testing)
> >> - the BioSQL database is filled with these identifiers using
> [BioPython](https://biopython.org/wiki/BioSQL) (basically using
> Entrez.efetch to fetch genbank records for the identifiers and then loading
> these into the database)
> >>
> >> Although only using a subset of the NCBI `nt` database, I do have
> fungal sequences in it, for example:
> >>
> >> ```
> >> SELECT
> >> be.accession,
> >> be.description,
> >> substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
> >> FROM
> >> biosequence bs
> >> INNER JOIN
> >> bioentry be USING (bioentry_id)
> >> INNER JOIN
> >> biodatabase bd USING (biodatabase_id)
> >> WHERE
> >> be.accession = 'XM_024821459';
> >> ```
> >>
> >> which returns a fungal record:
> >>
> >> ```
> >> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein
> (P174DRAFT_250220), mRNA|GTTTCTTCG...
> >> ```
> >>
> >> Many thanks,
> >> Pedro
> >>
> >>
> >> _______________________________________________
> >> BioSQL-l mailing list
> >> BioSQL-l at mailman.open-bio.org
> >> https://mailman.open-bio.org/mailman/listinfo/biosql-l
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biosql-l/attachments/20190610/24af75b4/attachment-0001.htm>
More information about the BioSQL-l
mailing list