<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sun, May 13, 2018 at 12:26 AM, Gordon Haverland <span dir="ltr"><<a href="mailto:ghaverla@materialisations.com" target="_blank">ghaverla@materialisations.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">On Fri, 11 May 2018 10:12:04 +0100<br>
Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com">p.j.a.cock@googlemail.com</a>> wrote:<br>
<br>
> This year the NCBI started offering this data in a slightly newer<br>
> format:<br>
> <br>
> <a href="https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/" rel="noreferrer" target="_blank">https://ftp.ncbi.nlm.nih.gov/<wbr>pub/taxonomy/new_taxdump/</a><br>
> <br>
> Most of these files are plain text tables using the rather<br>
> unusual field separator of "\t|\t" (tab, pipe, tab), but the<br>
> README files are very comprehensive.<br>
<br>
</span>I found this, and got the tarball version. I thought the README said<br>
it was \t|\n? Doesn't matter, it's an unusual separator.<br></blockquote><div><br></div><div>From memory, yes, the record separator is tab pipe newline,</div><div>but the field separator is tab pipe tab.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
There are Perl scripts in the tarball. I think I read there, that if<br>
the NCBI dump files are older than 180 days, it downloads newer<br>
versions? Or maybe I was reading something else.<br>
<br>
In any event, the BioSQL site at Github doesn't see much updating. It<br>
looks to me like all the activity is in biopython, so I downloaded that<br>
for my Devuan machine.<br></blockquote><div><br></div><div>As a mature database schema, we'd not expect much change.</div><div>The only substantial change in BioSQL in recent years was</div><div>extending the schema to work on SQLite.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">
> This is in Python, but my most recent occasion to process<br>
> this data was to make a cut-down version of the NCBI<br>
> taxonomy as part of constructing a small test dataset:<br>
> <br>
> <a href="https://github.com/abaizan/kodoja/blob/master/test/taxonomy/filter_taxonomy.py" rel="noreferrer" target="_blank">https://github.com/abaizan/<wbr>kodoja/blob/master/test/<wbr>taxonomy/filter_taxonomy.py</a><br>
<br>
</span>I seen this at Google, you labelled something a bug.<br></blockquote><div><br></div><div>Possibly you meant this recent work - something I had been</div><div>meaning to fix, but this conversation promoted me to do it:</div><div><br></div><div><a href="https://github.com/abaizan/kodoja/pull/24">https://github.com/abaizan/kodoja/pull/24</a><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
In looking for the new_taxdump thing (via Google), another Perl script<br>
about findingSpeciesFromGenus (or something like that) popped up. So,<br>
I have a few things of source to look through.<br>
<br>
Thanks.<br>
<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
Gord</div><div class="gmail-h5"><br></div></div></blockquote><div><br></div><div>Yes, the NCBI taxonomy has existing in this format for over</div><div>a decade I think - there should be lots of scripts out there</div><div>for use/guidance.</div><div><br></div><div>Peter </div></div><br></div></div>