[Biopython-dev] SeqIO Abi Parser

Wibowo Arindrarto w.arindrarto at gmail.com
Tue Aug 9 14:59:37 UTC 2011


Hi Peter,

You're welcome :)! Although a bit disappointing, it was nice when I
understood why my forward parser didn't work.

Regards,
---
Wibowo Arindrarto (bow)
http://bow.web.id


On Tue, Aug 9, 2011 at 15:40, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Sat, Aug 6, 2011 at 10:52 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > Hi Peter & everyone,
> > I've been trying to improve the parser so it works with forward-only
> > handles, but I'm drawing a blank for now.
> > I realized the reason I use seek in the first place was because of the
> file
> > structure. In an Abi file we've got three data blocks: the header that
> > contains the file information, the sequencing data, and the directories
> > which serve as indexes to the sequencing data. To unpack the sequencing
> data
> > bytes, we need the information stored in the directories. Depending on
> its
> > size, it could be stored outside the directories block, or in the
> directory
> > itself. This is why .seek() helps, because it allows for jumping between
> the
> > directories and the sequencing data as it is being parsed.
>
> Yes - this design makes sense, especially given the computer
> capabilities back when the format was designed.
>
> > Now, I thought the three blocks were stored in this order: header -
> > directory - sequencing data. I've thought of a way of parsing the file if
> > the structure is like this. As it turns out, it's possible (or even this
> > might be the norm) that the order is: header - sequencing data -
> directory.
> > So as soon as I finished parsing the information on how to retrieve the
> data
> > from the directories, I've already gone past the data block. In
> forward-only
> > handles, this makes the data irretrievable.
>
> I see now, that is unfortunate. I presume the current order was chosen
> to make writing the data easy (do the directory last). A simple forward
> only parser would be possible IF the data was reordered, but we can't
> require that.
>
> > There should be other ways to retrieve the sequencing data in
> forward-only
> > handles. I thought about reading the entire handle stream first and
> storing
> > it into a variable. This way, we could replace seek() with slicing
> > operators. The trade off is we store the entire handle stream in memory
> at
> > once (abi files are probably ~300-500kb in size). I'm sure there are
> other
> > ways, but I couldn't think of any now.
> > So what do you think? Or maybe anyone else have ideas that I could try?
> > Regards & have a nice weekend all,
>
> I think we have to accept that typical ABI files are not suitable for
> forward
> only parsing. Thanks for looking into this - I hope you found it
> interesting.
>
> Regards,
>
> Peter
>



More information about the Biopython-dev mailing list