[Biopython] SeqIO.index()

Sat Jan 30 08:46:56 UTC 2010

Dear community,

I am new to the mailing list and have a problem/question regarding the
SeqIO.index() method/module. Up to now, I usually used an home-brewed
fasta-file parser. This time though I had a look at the SeqIO
interface. I am especially interested in the index() method.

The fasta-file I use have non-standardized (if this is even possible)
headers. I found that the index method uses the first string after the
marker up to a space as the identifier for the dictionary (I will call
this ID in the text below). It is however a great idea to have a
function argument "key_function" that allows for adjust the key values
via a self implemented callback function. This is essential in my case
because ID in my fasta-file are not unique per entry.

I had a look at the source code of SeqIO/_index.py and I found that
unfortunately in the current implementation the "key_function" only
acts on ID. I think it would make more sense to allow to extract a key
from the complete header. Is this somehow possible with the current
implementation?

I refer here to the code in SeqIO/_index.py:

188 class _SequentialSeqFileDict(_IndexedSeqFileDict) :
.
.
.
200             if marker_re.match(line) :
201                 #Here we can assume the record.id is the first
word after the
202                 #marker. This is generally fine... but not for
GenBank, EMBL, Swiss
203
self._record_key(line[marker_offset:].strip().split(None,1)[0],
offset)         ##### here you define that the key_function only acts
on the first split

Thanks,
Seb