[Biopython-dev] Fwd: [Open-bio-l] Proposed BLAST XML Changes

Peter Cock p.j.a.cock at googlemail.com
Tue Mar 18 10:17:48 UTC 2014


On Tue, Mar 18, 2014 at 9:52 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Peter, everyone,
>
> Thanks for the heads up. If implemented as it is, the updates will
> change our underlying SearchIO model (aside from the blast-xml parser
> itself), by allowing a Hit retrieval using multiple different keys.

Could you clarify what you mean by multiple keys here?

> I have a feeling it will be difficult to jam all the new changes into
> a backwards-compatible parser. One way to make it transparent to users
> is to use the underlying DTD to do validation before parsing (for the
> two BLAST DTDs, use the one which the file can be validated against).
> However, this comes at a price. Since the standard library-bundled
> elementtree doesn't seem to support validation, we have to use another
> library (lxml is my choice). This means adding 3rd party dependency
> which require compiling (lxml is also partly written in C).

We can probably tell by sniffing the first few lines... but how
to do that without using a handle seek to rewind may be
tricky (desirable to support parsing streams, e.g. stdin).

> The other option is to introduce a new format name (e.g.
> 'blast-xml2'), which makes the user responsible for knowing which
> BLAST XML he/she is parsing. It feels more explicit this way, so I am
> leaning towards this option, despite 'blast-xml2' not sounding very
> nice to me ;).
>
> Any other thoughts?
>
> Best,
> Bow

I agree for the SearchIO interface, two format names makes
sense - unless there is a neat way to auto-detect this on input.

Using "blast-xml2" would work, or maybe something like
"blast-xml-2014" (too long?).

We could even go for "blast-xml-old" and "blast-xml" perhaps?

Peter



More information about the Biopython-dev mailing list