[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Tue Sep 1 13:25:22 UTC 2009

On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [indexed dict usage]
>> What file formats where you working on, and how many records?
>
> It was a 100Mb fasta file with about 41,000 records. Nothing too
> heavy but it worked great.

Yeah, with just 41,000 keys and offsets the in memory dict would
be pretty small too. This is within the range of file sizes I expect
the Bio.SeqIO.indexed_dict() functionality to be used on. Cool.

> The only change I made was to generalize the record building line:
>
> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)
>
> to allow an arbitrary function to be passed to define the
> identifier, instead of defaulting to the first part of the line.
> This is helpful for those fun NCBI ids
> (gi|83029091|ref|XM_357633.3|) where other parts of the program only
> have the accession number.

Did your callback function get give the "title string" and return
the desired key?

I had wondered about this, but the only way for this to be general
(to work on all file formats) is for the callback function to be given
a SeqRecord object - which means having to fully parse the file
during the indexing, which ends up being *much* slower. We can
do this is you think it adds a lot of utility i.e. mimic the key_function
argument we already have on Bio.SeqIO.to_dict()

Peter