[Bioperl-l] parse multi xml

Mon Nov 22 20:23:29 UTC 2010

On Mon, Nov 22, 2010 at 8:01 PM, Jordi Durban <jordi.durban at gmail.com> wrote:
> Hi all,
> I'm a newbie in the list although I've been using bioperl for 2 years.
> Now I have a problem with a XML file and I don't know how to parse it.
> That file has 795 xml top tags (thta's is <?xml version="1.0">) because they
> resulted from  Blast2go software the usage and I suppose the file is the
> outcome of multiple blast results concatenation.

Such a file is NOT a valid XML file (but see below), you can't just
concatenate XML files. I'm pretty sure people have posted scripts
to fix such files on the blast2go mailing list.

> Well, I would like to split all 795 different xml chunks in 795 different
> files in order to parse them looking for the best hit.
> The problem appears using the blastxml parse
> (*Bio::SearchIO::blastxml) *because
> (and that's a personal opinion) there's another top tag not expected
> and I get a error message once the first blast result was parsed.
> How can I do that split function?
> I hope I was clear
> Thanks

Historically the NCBI standalone BLAST used to create these
concatenated XML files when used on multiple queries. It has
since been fixed, but perhaps BioPerl has code still in it to
handle these legacy invalid XML files?

My suggestion (until a BioPerl guru speaks up) would be to
split the file into chunks (in memory) by looking for the string
<?xml version="1.0">, and parsing each chunk individually.
Each chunk should be a valid XML file on its own.

Peter