[BioPython] Entrez.efetch large files

Wed Oct 8 20:57:03 UTC 2008

On Wed, Oct 8, 2008 at 9:37 PM, Stephan Schiffels <stephan80 at mac.com> wrote:
>
>  Hi Peter,
>
> OK, first of all... you were right of course, with
> out_handle.write(net_handle.read()) the download works properly and reading
> the file from disk also works.The tutorial is very clear on that point, I
> agree.

OK - hopefully I've just made it clearer still ;)

> To illustrate why I made the mistake even though I read the tutorial:
> I made some code like:
>
> try:
>        unpickling a file as SeqRecord...
> except IOError:
>        download file into SeqRecord AND pickle afterwards to disk
>
> So, as you can see, I already tried to make the download only once!

I see - interesting.

> The disk-saving step, I realized, was smarter to do via cPickle since then
> reading from it also goes faster than parsing the genbank file each time. So
> my goal was to either load a pickled SeqRecord, or download into SeqRecord
> and then pickle to disk. I hope you agree that concerning resources from
> NCBI this way is (at least in principle) already quite optimal.

You approach is clever, and I agree, it shouldn't make any difference
to the number of downloads from the NCBI (once you have the script
debugged and working).

I'm curious - do you have any numbers for the relative times to load a
SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
aware of some "hot spots" in the GenBank parser which take more time
than they really need to (feature location parsing in particular).

However, even if using pickles is much faster, I would personally
still rather use this approach:

if file not present:
   download from NCBI and save it
parse file

I think it is safer to keep the original data in the NCBI provided
format, rather than as a python pickle.  Some of my reasons include:

* you might want to parse the files with a different tool one day
(e.g. grep, or maybe BioPerl, or EMBOSS)
* different versions of Biopython will parse the file slightly
differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
should include slightly more information from a GenBank file) while
your pickle will be static
* if the SeqRecord or Seq objects themselves change slightly between
versions of Biopython, the pickle may not work
* more generally, is it safe to transfer the pickly files between
different computers (e.g. different versions of python or Biopython,
different OS, different line endings)?

These issues may not be a problem in your setting.

More generally, you could consider using BioSQL, but this may be
overkill for your needs.

> However, as you pointed out, parsing from the internet makes problems.

If you do work out exactly what is going wrong, I would be interested
to hear about it.

> I think the advantages of not having to download each time were clear to me
> from the tutorial. Just that downloading AND parsing at the same time makes
> problems didnt appear to me. The addings to the tutorial seem to give some
> idea.

Your approach all makes sense. Thanks for explaining your thoughts.  I
don't think I'd ever tried efetch on such a large GenBank file in the
first place - for genomes I have usually used FTP instead.

Peter