[Biopython-dev] SeqIO Abi Parser

Wibowo Arindrarto w.arindrarto at gmail.com
Sat Aug 6 09:52:13 UTC 2011


Hi Peter & everyone,

I've been trying to improve the parser so it works with forward-only
handles, but I'm drawing a blank for now.

I realized the reason I use seek in the first place was because of the file
structure. In an Abi file we've got three data blocks: the header that
contains the file information, the sequencing data, and the directories
which serve as indexes to the sequencing data. To unpack the sequencing data
bytes, we need the information stored in the directories. Depending on its
size, it could be stored outside the directories block, or in the directory
itself. This is why .seek() helps, because it allows for jumping between the
directories and the sequencing data as it is being parsed.

Now, I thought the three blocks were stored in this order: header -
directory - sequencing data. I've thought of a way of parsing the file if
the structure is like this. As it turns out, it's possible (or even this
might be the norm) that the order is: header - sequencing data - directory.
So as soon as I finished parsing the information on how to retrieve the data
from the directories, I've already gone past the data block. In forward-only
handles, this makes the data irretrievable.

There should be other ways to retrieve the sequencing data in forward-only
handles. I thought about reading the entire handle stream first and storing
it into a variable. This way, we could replace seek() with slicing
operators. The trade off is we store the entire handle stream in memory at
once (abi files are probably ~300-500kb in size). I'm sure there are other
ways, but I couldn't think of any now.

So what do you think? Or maybe anyone else have ideas that I could try?

Regards & have a nice weekend all,
---
Wibowo Arindrarto (bow)
http://bow.web.id



On Thu, Aug 4, 2011 at 13:47, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Aug 4, 2011 at 12:30 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > Hi Peter,
> > Ah yes, I didn't know there could be handles without .seek() and .tell(),
> > and I thought those two are the proper way of traversing files, so I used
> > them. I also didn't realize you could use SeqIO with network handles,
> too.
> > This is really neat :).
>
> Yes - having a handle focused API makes some clever stuff possible :)
> Of course, parsing sequences directly from network handles isn't always
> a good idea, but it can be useful.
>
> > In any case, sure, I'd love to make some changes to the current AbiIO
> code
> > so it works without .seek() and .tell(). Is there any other input types
> that
> > does not use .seek() and .tell() other than network handles?
>
> I suspect some specialised handles for accessing compressed files might
> have similar limitations. In the case of gzip at least, I think it does
> support
> seek and tell.
>
> > Here's my new branch from the current master:
> > https://github.com/bow/biopython/tree/seqio-abi_handlefix
> > nothing different for now but I'll push my updates soon.
>
> Don't rush yourself - I'm away for a long weekend so won't be testing
> any updates till next week anyway.
>
> Thanks,
>
> Peter
>



More information about the Biopython-dev mailing list