[Biopython-dev] [Biopython] SeqIO.index improvement suggestions

Sat Dec 19 09:57:25 UTC 2009

On Fri, Dec 18, 2009 at 11:39 PM, Renato Alves wrote:
> Sorry to take this to the discussion list, took a bit longer than I
> expected to get the approval.
>
> Bringing now the subject to the right place. Leaving full quote history
> to help the reading.

Thanks.

>> That was a deliberate choice in that the index code wants to "own"
>> the handle. If other code has access to the handle, there is a risk
>> of different bits of code moving the handle pointer etc. But, if you
>> are careful it could be done.
>
> The way I approached it was to reset the handle pointer to the first
> position, since we would like to index the full file. But I understand
> that if the user uses the same handle on different files weird results
> may happen.

OK

> Something that could be a simple workaround would be to copy the
> filehandle object in such a way that it's properties are maintained
> (like being a gzip.open() filehandle) but it's use doesn't affect the
> use of the original handle. However I don't know if this is possible.

That may work for some handles but not others. Worth trying.

>> There are also issues here in combination with saving the index.
>> With a filename, the code can easily reopen the file in the same
>> mode. With a handle, things are more tricky. You have non-file
>> handles to consider - such as the gzip example. There is also the
>> problem of recording the file mode (normal text, universal text,
>> or binary - which we will need for SFF files - code already written).
>
> I see, only after your comment I realized handle.name and handle.mode
> are only available in normal filehandles. The gzip.open() example stores
> the filename in .filename while the .mode seems to have a different
> meaning.

That would make finding out the filename from a handle tricky.

>> If we do change the code to allow handles, it would have to be
>> to allow handles OR filenames to be compatible with Biopython
>> 1.52 and 1.53 (which take just filenames). This could be handled
>> as in Bio.SeqIO.convert(), which also allows both (which was the
>> subject of some discussion!).
>
> I'll have to look more on the example and consider the fact that my
> current implementation breaks compatibility with previous code and that
> not everything needed (filename, mode,...) is accessible in filehandles.

OK.

>> Yes, saving indexes is an obvious addition. I have explored
>> using pickle via shelve, and also SQLite - there are
>> implementations of this on my github respository, plus
>> begun to look into the existing OBF Open Biological
>> Database Access (OBDA) specification for cross project
>> compatibility. Other potential benefits here are reduced
>> memory usage if we don't keep the dictionary
>> of offsets in RAM.
>
> I did try to use pickle directly on the dict like object that is
> returned from SeqIO.index() but pickle was not happy with it. The SQLite
> approach also crossed my mind and also BioSQL or just some custom SQL
> database, but the RAM approach seemed good enough, at least for our
> current uses. I can see though that some file formats will require a lot
> more RAM depending on what is indexed and their size. In the end it came
> out as cPickled dictionaries for faster access.

I agree that an in RAM dictionary works pretty well, even for
very large sequence files. In terms of speed, I would expect
a two step build index in memory, then save to disk, to be
faster than building the index database on disk (which was
a bit slow).

>> There is a potential complication with index sub-classes
>> which do more specialised indexing (e.g. GenBank files,
>> and for a more extreme case, SFF files). See:
>> http://github.com/peterjc/biopython/tree/sff-seqio
>
> For these I would have to do it on a unittest base, I'm not familiar
> with the formats. Also the implementation I did was based on
> the current master branch of biopython. I now realize a lot more
> has been done outside of it that I should look into.

I'm sorry if the discussion on the (dev) mailing list wasn't
clearer - but having a fresh set of eyes looking at the topic
is very useful.

>> Anyway - great to see you are finding the code useful,
>> and have some quite similar ideas for how to extend
>> it further.
>
> Thanks for all that info, I have a lot to dig into and see if I can
> actually contribute with something. You seem to have pretty much
> everything sorted ;)

Well, i hadn't been thinking about gzipped files (or any archives).
How does gzip behave with memory use? I assume it doesn't
load everything into RAM, but does allow you random access
(seek and tell).

This is a vague idea (which I haven't tried yet), but maybe the
Bio.SeqIO.index() function could take an optional argument
(gzip=True, or something more general like archive=...) which
would cause the file to be opened via the gzip module instead?

Regards,

Peter