[Bioperl-l] Starting to use Bioperl

Peter Cock p.j.a.cock at googlemail.com
Fri May 11 09:12:04 UTC 2018


Hi Gordon,

A couple of bits of background reading for you.

First, there is a database schema called BioSQL which might be
of interest in that it includes taxon tables - based primarily on the
NCBI taxonomy tree but it could be used for another taxonomy.
There is an SQLite version of this (in use by Biopython) but that
has not as far as a I know been integrated into BioPerl yet.

http://biosql.org
https://github.com/biosql/biosql

I think given your taxonomy focus, you can ignore BioSQL which
is more suited to working with NCBI/EMBL annotated sequences.

Now, I mentioned the NCBI taxonomy, which is a de facto world
standard but will not always reflect the latest expert opinion in
all branches of life. Nevertheless, I would start there.

You can query the NCBI taxonomy via Entrez (and by hand on
the website), see how to walk the tree, ignore the boring ranks,
until you reach the root of the tree.

Or, you can download the NCBI taxonomy as a set of text files,
for which you should have no trouble finding examples scripts
to load and work with:

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

This year the NCBI started offering this data in a slightly newer
format:

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/

Most of these files are plain text tables using the rather
unusual field separator of "\t|\t" (tab, pipe, tab), but the
README files are very comprehensive.

This is in Python, but my most recent occasion to process
this data was to make a cut-down version of the NCBI
taxonomy as part of constructing a small test dataset:

https://github.com/abaizan/kodoja/blob/master/test/taxonomy/filter_taxonomy.py

Peter


On Fri, May 11, 2018 at 1:50 AM, Gordon Haverland <
ghaverla at materialisations.com> wrote:

> On Wed, 9 May 2018 09:54:18 -0700
> Gordon Haverland <ghaverla at materialisations.com> wrote:
>
> >           ... I am researching a deer problem.
>
> There are BioPerl and Bio-LITE routines which can work with taxonomy
> information.  Finding something which can write a SQLite3 dbase took a
> little digging, but something does exist.
>
> I've never played with BioPerl before, and I am still trying to clean
> and expand my deer plant data, so I ran my latest effort with a call to
> BioPerl to look up a taxonid and then a taxon.  It just happened the
> first element in my list was a hybrid species (Abelia x grandiflora).
> Anyway, following some BioPerl documentation I connected to -entrez
> (excuse any spelling mistakes) and it came up with a hit.  A species
> hit, which is what I was hoping for.
>
> From that returned object, I can get an ancestor object (which is a
> genus), and from that I can get an ancestor object which is a family,
> and from that I can get an ancestor object which is an order and then
> further iterations on ancestor get non_ranked clade stuff which I am
> not sure how to handle.  I haven't tried iterating to the limit, I was
> hoping that at some point an attempt to return an ancestor would return
> under.  But I really don't know what to do with this non_rank clade
> stuff.
>
> I suspect, I need to iterate this ancestor stuff until I get to kingdom
> plantae?  This gives me a "root".  I now have a species (usually) with
> N ancestors up to a common root (kingdom plantae).  That constitutes a
> tree as I understand things, but it is all one sided.
>
> If I go to the next entry in my deer resistant plants data, I may have
> M ancestors up to kingdom plantae.   And do this for 1000 or so other
> entries.
>
> For each set of ancestor lookups, I need to make a tree.
>
> All of these trees have the same root (kingdom plantae).  So I should
> be able to add all these trees together.  And then I think I found the
> utilities to save this mess as SQLite.
>
> As I understand things, I probably want to be working with NCBI ID
> numbers on the species entered?  And what you call annotation, I would
> save in one or more separate SQLite3 dbases keyed on the NCBI ID number?
>
> Let's assume one of the fields of annotation is the USDA growing zone.
> A person thinks they want to do a query on USDA Zone 3, so the program
> changes this to a query for USDA Zones 2-4, which picks off all the
> NCBI ID numbers, and then a person can use BioPerl to make a picture of
> all the deer resistant taxonomy known.
>
> One of the sources of data into this, has colour of the flowers.  So
> someone could conceivably be looking for pink flowered, deer resistant
> plants.  That's why I suggested there might be more than 1 SQLite dbase
> of annotation to go with this stuff.
>
> I'll stop writing, and go back to reading code.  I downloaded the
> Bio-LITE modules (not at Debian/Devuan), and I think there were
> suggestions of other code to download.  And read.
>
> Have a great day!
> Gord
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/bioperl-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/bioperl-l/attachments/20180511/b4fbfdca/attachment.html>


More information about the Bioperl-l mailing list