[Bioperl-l] parse multi xml

Tue Nov 23 00:27:43 UTC 2010

On Nov 22, 2010, at 2:23 PM, Peter wrote:

> On Mon, Nov 22, 2010 at 8:01 PM, Jordi Durban <jordi.durban at gmail.com> wrote:
>> Hi all,
>> I'm a newbie in the list although I've been using bioperl for 2 years.
>> Now I have a problem with a XML file and I don't know how to parse it.
>> That file has 795 xml top tags (thta's is <?xml version="1.0">) because they
>> resulted from  Blast2go software the usage and I suppose the file is the
>> outcome of multiple blast results concatenation.
> 
> Such a file is NOT a valid XML file (but see below), you can't just
> concatenate XML files. I'm pretty sure people have posted scripts
> to fix such files on the blast2go mailing list.
> 
>> Well, I would like to split all 795 different xml chunks in 795 different
>> files in order to parse them looking for the best hit.
>> The problem appears using the blastxml parse
>> (*Bio::SearchIO::blastxml) *because
>> (and that's a personal opinion) there's another top tag not expected
>> and I get a error message once the first blast result was parsed.
>> How can I do that split function?
>> I hope I was clear
>> Thanks
> 
> Historically the NCBI standalone BLAST used to create these
> concatenated XML files when used on multiple queries. It has
> since been fixed, but perhaps BioPerl has code still in it to
> handle these legacy invalid XML files?

It does (last I looked).

> My suggestion (until a BioPerl guru speaks up) would be to
> split the file into chunks (in memory) by looking for the string
> <?xml version="1.0">, and parsing each chunk individually.
> Each chunk should be a valid XML file on its own.
> 
> Peter

Or, better yet, push the blast2go folks to create valid XML output or use an updated version of BLAST.  This bug was fixed about 3 years ago.

chris