[Biopython-dev] Fwd: [Open-bio-l] Proposed BLAST XML Changes

Tue Mar 18 10:33:55 UTC 2014

On Tue, Mar 18, 2014 at 11:17 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Mar 18, 2014 at 9:52 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> Hi Peter, everyone,
>>
>> Thanks for the heads up. If implemented as it is, the updates will
>> change our underlying SearchIO model (aside from the blast-xml parser
>> itself), by allowing a Hit retrieval using multiple different keys.
>
> Could you clarify what you mean by multiple keys here?

Currently, we can retrieve hits from a query using its ID, aside from
its numeric index. With their proposed changes to the Hit element
here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf,
it means that a given Hit can now be annotated with more than one ID.
Ideally, this should also be reflected in the QueryResult object: a
hit item should be retrievable using any of the IDs it has.

This will also affect membership checking on the QueryResult object.

>> I have a feeling it will be difficult to jam all the new changes into
>> a backwards-compatible parser. One way to make it transparent to users
>> is to use the underlying DTD to do validation before parsing (for the
>> two BLAST DTDs, use the one which the file can be validated against).
>> However, this comes at a price. Since the standard library-bundled
>> elementtree doesn't seem to support validation, we have to use another
>> library (lxml is my choice). This means adding 3rd party dependency
>> which require compiling (lxml is also partly written in C).
>
> We can probably tell by sniffing the first few lines... but how
> to do that without using a handle seek to rewind may be
> tricky (desirable to support parsing streams, e.g. stdin).

Ah yes. We have a rewindable file seek object in Bio.File, don't we
:)? I'll have to play around with some real datasets first, I think.

The other thing we should take into account is the Xinclude tag. Would
we want to make it possible to query *either* the single query XML
results or the master Xinclude document (point 2 of the proposed
change)? Or should we restrict our parser only to the single query
files?

>> The other option is to introduce a new format name (e.g.
>> 'blast-xml2'), which makes the user responsible for knowing which
>> BLAST XML he/she is parsing. It feels more explicit this way, so I am
>> leaning towards this option, despite 'blast-xml2' not sounding very
>> nice to me ;).
>>
>> Any other thoughts?
>>
>> Best,
>> Bow
>
> I agree for the SearchIO interface, two format names makes
> sense - unless there is a neat way to auto-detect this on input.
>
> Using "blast-xml2" would work, or maybe something like
> "blast-xml-2014" (too long?).
>
> We could even go for "blast-xml-old" and "blast-xml" perhaps?

Hmm..'blast-xml-old', may make it difficult to adapt for future XML
schema changes. How about renaming the current parser to
'blast-xml-legacy', and the new one to just 'blast-xml'?

Cheers,
Bow