[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Peter biopython at maubp.freeserve.co.uk
Tue Sep 8 13:22:35 UTC 2009


n Tue, Sep 8, 2009 at 1:14 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [... callback function for specifying an ID ...]
>
>> A less flexible option is a callback function which maps the default
>> record.id to a new key. This would solve your NCBI FASTA issue,
>> and might be handy in other settings (e.g. removing the version
>> suffix in GenBank identifiers). However, it would not allow for
>> example switching to a completely different identifier (e.g. the GI
>> number) which is present elsewhere in the file.
>>
>> The point is we can support this kind of limited key_function
>> without suffering the severe speed penalty which doing a full
>> parse to give SeqRecord objects would impose.
>
> This is a great compromise. You're right, parsing the SeqRecord is too
> much, and allowing manipulation of default identifier would work fine.

Cool - done in CVS, including the docstring and the tutorial.

> If people need to do something much more complicated to get the ID
> they would probably be better off extending the existing classes and
> writing a custom indexer that pulls the IDs they need.

Certainly - we can't expect to cover every possible use case, and
trying to do so will result in an overly complicated API.

Did you have any ideas for a better name than Bio.SeqIO.indexed_dict()?

Peter



More information about the Biopython-dev mailing list