[Biopython] About BLAST parser

Thu Oct 22 10:34:55 UTC 2009

With all blast hits included, the output file is around 1 gigabyte.  
Therefore just opening and searching for the broken parts is  
challenging with regular text editors. Furthermore, I'm not very  
familiar with XML syntax and therefore would probably not recognize  
the broken parts.

Breaking down the search into smaller parts sounds like a good idea.  
However, I'm also considering writing a more robust script. Would it  
be possible to make the script ignore the broken entries in the XML  
file and skip into next correct one?

On Oct 22, 2009, at 1:19 PM, Peter wrote:

> On Thu, Oct 22, 2009 at 11:06 AM, Manu Tamminen <mavata at gmail.com>  
> wrote:
>>
>> Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis  
>> on a
>> supercomputer cluster, saved the results into a XML file and then
>> transferred the output file to my computer. I then run the script  
>> on my
>> computer to parse the results into a tab separated file. With the  
>> current
>> dataset I have 1115 sequences of around 500 bp each.
>> Manu
>
> Based on the Biopython error message, I suspect your XML file is
> broken. How big is the XML file (MB). There are online tools for this,
> but uploading a large file is out of the question. You could also open
> the file in a suitable editor, go to the line number given in the  
> Biopython
> error message, and look at the file by eye to see if there is anything
> obvious.
>
> It is possible that the XML file was corrupted when you copied it to
> your local machine (e.g. a network error). You could try zipping it
> up, and then copying it again. It is also possible that the XML file
> was corrupted on the disk on the cluster (rare, but this can happen).
> In this case you might be able to fix the XML by hand, or re-run it.
>
> Alternatively, it is possible that the file is valid, and the  
> Biopython parser
> (or the Python library we use internally) has a bug. As long as the
> XML file isn't too big (say 10MB), you could email it to me personally
> (NOT the mailing list) and I can try and have a look at it.
>
> Personally, I would break up the task into jobs (maybe six jobs of
> up to 200 sequences each - or even one sequence per job). On
> most clusters this is a good idea anyway, as they can then be
> handled by different cluster nodes. For the analysis, you just have
> to parse the separate XML files. Any corrupted XML file will then
> only affect a few sequences, and checking it or re-running it is
> going to be much quicker and easier.
>
> Peter

---
Manu Tamminen, M.Sc.
University of Helsinki
Department of Applied Chemistry and Microbiology, Division of  
Microbiology
P.O. Box 56
00014 HELSINKI
FINLAND

tel: +358 (0)9191 57585
fax:  +358 (0)9191 59322
e-mail: manu.tamminen at helsinki.fi
home: http://www.mm.helsinki.fi/~mvtammin/