[Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many

Peter biopython at maubp.freeserve.co.uk
Tue Dec 7 15:11:56 UTC 2010


On Tue, Dec 7, 2010 at 1:59 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter;
>
>> You may recall some previous discussion about extending the
>> Bio.SeqIO.index functionality. I'm particularly interested in
>> keeping the index on disk to reduce the memory overhead
>> and thus support NGS files with many millions of reads. e.g.
> [...]
>> I've been working on the follow idea on branches in github,
>> and have something workable using SQLite3 to store a
>> table of record identifiers, file offset, and file number
>> (for where we have multiple files indexed together).
> [...]
>> https://github.com/peterjc/biopython/tree/index-many
>
> This is great and definitely needed. The implementation
> looks nice and fits with the current index functionality,
> and SQLite definitely seems like the right choice.
> So a big +1 on all of this.
>
> My only suggestion would be the naming: index_file makes it a little
> clearer about the intentions, instead of index_many (the best
> naming would be 'index' for this functionality and 'index_memory' for
> the in-memory indexing, but the ship has probably sailed on that).

Yes, we've already used "index" for the in-memory index, and
its API doesn't lend itself to being extended in this way. So too
late now.

What do you think of index_files (plural) rather than index_file?

Peter



More information about the Biopython-dev mailing list