[Biopython] About BLAST parser

Thu Oct 22 10:51:45 UTC 2009

On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen <mavata at gmail.com> wrote:
>
> With all blast hits included, the output file is around 1 gigabyte.
> Therefore just opening and searching for the broken parts is challenging
> with regular text editors. Furthermore, I'm not very familiar with XML
> syntax and therefore would probably not recognize the broken parts.

There is probably a neat way to extract a chunk using Unix command
line tools. Or just try something like this in Python:

error_line = 82921
input_handle = open("really_big.xml")
output_handle = open("fragment.txt", "w")
for line_number, line in enumerate(input_handle) :
    if error_line - 1000 < error_line and error_line < error_line + 1000 :
        output_handle.write(line)
input_handle.close()
output_handle.close()

I would still suggest you re-try copying it from the cluster to your
machine, in case it was just a network error corrupting the machine.

> Breaking down the search into smaller parts sounds like a good idea.
> However, I'm also considering writing a more robust script. Would it be
> possible to make the script ignore the broken entries in the XML file and
> skip into next correct one?

I think that will be tricky. Part of idea about XML is it is a strictly defined
file format where there are standards about how to interpret and abort
with bad data. Tolerant XML parsers are considered to be a bad thing.

What should be possible is a simple script that removes the broken
section of the file, giving a (partial) but valid XML file covering most
of the sequences. It might be more effort than just re-doing the search
(in parts this time).

Peter