[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Mon Aug 31 13:49:40 UTC 2009

On Mon, Aug 31, 2009 at 2:24 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>
> Hi Peter;
>
>> The Bio.SeqIO.indexed_dict() functionality is in CVS/github now
>> as I would like some wider testing. My earlier email explained the
>> implementation approach, and gave some example code:
>> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html
>
> Sweet. I pulled this from your branch earlier for something I was
> doing at work and it's great stuff.

Thanks :)

What file formats where you working on, and how many records?

> My only suggestion would be to
> change the function name to make it clear it's an in memory index.
> This will clear us up for similar file based index functions.

True. Have got any bright ideas for a better name? While the
index is in memory, the SeqRecord objects are not (unlike the
original Bio.SeqIO.to_dict() function).

Or we have one function Bio.SeqIO.indexed_dict() which can
either use an in memory index, OR an on disk index, offering
the same functionality.

>> Another option (like the shelve idea we talked about last month)
>> is to parse the sequence file with SeqIO, and serialise all the
>> SeqRecord objects to disk, e.g. with pickle or some key/value
>> database. This is potentially very complex (e.g. arbitrary Python
>> objects in the annotation), and could lead to a very large "index"
>> file on disk. On the other hand, some possible back ends would
>> allow editing the database... which could be very useful.
>
> My thought here was to use BioSQL and the SQLite mappings for
> serializing. We build off a tested and existing serialization, and
> also guide people into using BioSQL for larger projects.
> Essentially, we would build an API on top of existing BioSQL
> functionality that creates the index by loading the SQL and then
> pushes the parsed records into it.

Using BioSQL in this way is a much more general tool than
simply "indexing a sequence file". It feels like a sledgehammer
to crack a nut. Also, do you expect it to scale well for 10 million
plus short reads? It may do, but on the other hand it may not.

You will also face the (file format specific but potentially significant)
up front cost of parsing the full file in order to get the SeqRecord
objects which are then mapped into the database. My new
Bio.SeqIO.indexed_dict() code (whatever we call it) avoids this
and the speed up is very nice (file format specific of course).

Also while the current BioSQL mappings are "tried and tested",
they don't cover everything, in particular per-letter-annotation
such as a set of quality scores (something that needs addressing
anyway, probably with JSON or XML serialisation).

All the above make me lean towards a less ambitious target
(read only dictionary access to a sequence file), which just
requires having an (on disk) index of file offsets (which could
be done with SQLite or anything else suitable). This choice
could even be done on the fly at run time (e.g. we look at the
size of the file to decide if we should use an in memory index
or on disk - or start out in memory and if the number of records
gets too big, switch to on disk).

Peter