[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Wibowo Arindrarto
w.arindrarto at gmail.com
Fri Sep 21 23:03:10 UTC 2012
Hi guys,
On Fri, Sep 21, 2012 at 3:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sat, Sep 15, 2012 at 2:22 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> Hi guys,
>>
>>> > 2) If we add a function to Biopython that generates Blast plain-text
>>> > output (or something close to it) from Blast XML output, then a user can
>>> > generate the Blast output in XML format, parse it with Biopython,
>>> > optionally
>>> > filter it, and then generate the corresponding plain-text output;
>>>
>>> The new 'SearchIO' results objects str/repr should be familiar to
>>> anyone who has looked at the plain text BLAST output - but
>>> not identical. We could apply some of these improvements
>>> to the current BLAST parsers, but I favour aiming to simply
>>> deprecate them in favour of 'SearchIO' (namespace to be
>>> decided).
>>>
>>> However, we certainly could try and offer a plain-text BLAST
>>> output format from 'SearchIO', although IIRC Bow has not tried
>>> that yet. It shouldn't be too complicated - unless you aim for
>>> 100% agreement with the latest BLAST output (moving target).
>>
>> Yes, this has not been attempted ~ mostly because I feel that the
>> BLAST plain text is indeed a moving target. But, if we are in favor of
>> choosing one format from one BLAST version and always stick to it, it
>> sounds more reasonable.
>>
>> There are one missing detail that is only present in the plain text
>> format, though: the hit-level e-values. If we do decide to write a
>> plain text writer, we either have to demand the user supply these
>> values, or we omit the entire hit-level e-value table, or we fill it
>> with something else.
>
> Bow and I have just been over the BLAST+ source code,
> and confirmed the 'hit level e-value' shown in the plain text
> description table before the alignments is in fact just the
> e-value of the best HSP. i.e. The minimum e-value.
>
> So that isn't a problem afterall.
>
> Peter
Yes, I should've checked first how that e-value gets there. A little
peeking into the source code and it was apparent that it's the lowest
HSP-level e-value in the hit. So we don't have to worry about
calculating new values.
For the writing support, I agree with Eric ~ we could use the latest
BLAST legacy output as our target plain text format.
For parsing, I'm still not sure. Unless there's a massive speed-up, I
prefer to keep the current parser as the base given its versatility.
Perhaps I can do a bit more 'trimming' so that the parser directly
creates SearchIO objects. This won't be a major change to the logic,
though.
regards,
Bow
More information about the Biopython
mailing list