[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Peter Cock
p.j.a.cock at googlemail.com
Sat Sep 15 10:37:50 UTC 2012
On Sat, Sep 15, 2012 at 3:43 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Last weekend I also talked with Peter during his visit to Tokyo about the
> Blast (human-readable) plain-text parser. We could see three scenarios in
> which the plain-text parser has an advantage over the XML parser (Peter
> please correct me if I am missing something from our discussion):
>
> 1) The file size of Blast plain-text output may be smaller than that of
> Blast XML output;
> 2) Users may want to look at the Blast output by eye in addition to
> parsing it with Biopython;
> 3) Users may have stacks of old Blast output files in plain-text format
> that they still want to use.
Maybe also (3a) The user may want plain-text BLAST output to
input into another tool as well as Biopython?
>
> Each of these points can be addressed without a Blast plain-text parser:
> 1) After zipping, we expect little difference in file size between
> plain-text output and XML output;
However there would be a speed penalty - compression, then
decompression, and perhaps in XML versus text parsing.
> 2) If we add a function to Biopython that generates Blast plain-text
> output (or something close to it) from Blast XML output, then a user can
> generate the Blast output in XML format, parse it with Biopython, optionally
> filter it, and then generate the corresponding plain-text output;
The new 'SearchIO' results objects str/repr should be familiar to
anyone who has looked at the plain text BLAST output - but
not identical. We could apply some of these improvements
to the current BLAST parsers, but I favour aiming to simply
deprecate them in favour of 'SearchIO' (namespace to be
decided).
However, we certainly could try and offer a plain-text BLAST
output format from 'SearchIO', although IIRC Bow has not tried
that yet. It shouldn't be too complicated - unless you aim for
100% agreement with the latest BLAST output (moving target).
> 3) If this is really an issue, then we could create some standalone
> scripts (available from the Biopython website) that parses plain-text Blast
> output and generates the corresponding XML output. These scripts will be
> much easier than the current plain-text parser in Biopython, because we can
> create such a script for each version of Blast separately (of course this is
> only done if the need actually arises). The XML output can then be parsed by
> Biopython.
I was not convinced that this would actually save any effort over
continuing to tweak the current (complex but flexible) plain text
parser.
> Are there any other cases in which the plain-text parser is needed?
> Or where our proposed solutions to the three points above are not
> sufficient?
Benchmarking of parsing (a) plain text, (b) XML, (c) gzipped XML,
and (d) column rich tabular output might be worthwhile. There may
be a case for parsing plain-text on the basis of speed.
> If not, then I suggest we implement the plain-text generator in (2),
>
I certainly this adding plain-text output to 'SearchIO' would be
useful.
> and upgrade the PendingDeprecationWarning in
> Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning.
Another idea we touched on was deprecating the current old,
complex but flexible plain text parser while adding a new simpler
plain text parser as part of 'SearchIO'. Here we could target only
the recent BLAST+ output (and perhaps if not so different the
final 'legacy' BLAST release), and not worry about all the variants
the NCBI have produced over the years. I would hope this would
also be faster [especially as currently 'SearchIO' supports parsing
plain text BLAST on top of the existing old parser].
This boils down to a key question: How many people still want
to use the plain-text output and why? I believe that for most
use cases the tabular or XML output is better (covering simple
needs, and full parsing of every detail respectively).
e.g. It sounds like for Martin's example, the tabular output would
be a perfect match.
[Although, as I noted above, parsing the XML, especially if
compressed, may not be as fast as parsing plain text?]
While writing this email I was trying to recall when I last used
the plain text output - and the only situation I could think of
in the last year or so was in order to have something human
readable to show a collaborator. Here XML to plain text BLAST
would have been fine.
Peter
More information about the Biopython
mailing list