[Biopython-dev] SeqIO Abi Parser

Peter Cock p.j.a.cock at googlemail.com
Tue Aug 9 13:40:18 UTC 2011


On Sat, Aug 6, 2011 at 10:52 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Peter & everyone,
> I've been trying to improve the parser so it works with forward-only
> handles, but I'm drawing a blank for now.
> I realized the reason I use seek in the first place was because of the file
> structure. In an Abi file we've got three data blocks: the header that
> contains the file information, the sequencing data, and the directories
> which serve as indexes to the sequencing data. To unpack the sequencing data
> bytes, we need the information stored in the directories. Depending on its
> size, it could be stored outside the directories block, or in the directory
> itself. This is why .seek() helps, because it allows for jumping between the
> directories and the sequencing data as it is being parsed.

Yes - this design makes sense, especially given the computer
capabilities back when the format was designed.

> Now, I thought the three blocks were stored in this order: header -
> directory - sequencing data. I've thought of a way of parsing the file if
> the structure is like this. As it turns out, it's possible (or even this
> might be the norm) that the order is: header - sequencing data - directory.
> So as soon as I finished parsing the information on how to retrieve the data
> from the directories, I've already gone past the data block. In forward-only
> handles, this makes the data irretrievable.

I see now, that is unfortunate. I presume the current order was chosen
to make writing the data easy (do the directory last). A simple forward
only parser would be possible IF the data was reordered, but we can't
require that.

> There should be other ways to retrieve the sequencing data in forward-only
> handles. I thought about reading the entire handle stream first and storing
> it into a variable. This way, we could replace seek() with slicing
> operators. The trade off is we store the entire handle stream in memory at
> once (abi files are probably ~300-500kb in size). I'm sure there are other
> ways, but I couldn't think of any now.
> So what do you think? Or maybe anyone else have ideas that I could try?
> Regards & have a nice weekend all,

I think we have to accept that typical ABI files are not suitable for forward
only parsing. Thanks for looking into this - I hope you found it interesting.

Regards,

Peter




More information about the Biopython-dev mailing list