[Biopython] SeqIO.index()
Sebastian Schmeier
s.schmeier at gmail.com
Sat Jan 30 08:46:56 UTC 2010
Dear community,
I am new to the mailing list and have a problem/question regarding the
SeqIO.index() method/module. Up to now, I usually used an home-brewed
fasta-file parser. This time though I had a look at the SeqIO
interface. I am especially interested in the index() method.
The fasta-file I use have non-standardized (if this is even possible)
headers. I found that the index method uses the first string after the
marker up to a space as the identifier for the dictionary (I will call
this ID in the text below). It is however a great idea to have a
function argument "key_function" that allows for adjust the key values
via a self implemented callback function. This is essential in my case
because ID in my fasta-file are not unique per entry.
I had a look at the source code of SeqIO/_index.py and I found that
unfortunately in the current implementation the "key_function" only
acts on ID. I think it would make more sense to allow to extract a key
from the complete header. Is this somehow possible with the current
implementation?
I refer here to the code in SeqIO/_index.py:
188 class _SequentialSeqFileDict(_IndexedSeqFileDict) :
.
.
.
200 if marker_re.match(line) :
201 #Here we can assume the record.id is the first
word after the
202 #marker. This is generally fine... but not for
GenBank, EMBL, Swiss
203
self._record_key(line[marker_offset:].strip().split(None,1)[0],
offset) ##### here you define that the key_function only acts
on the first split
Thanks,
Seb
More information about the Biopython
mailing list