[Biopython] About BLAST parser
Peter
biopython at maubp.freeserve.co.uk
Thu Oct 22 10:51:45 UTC 2009
On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen <mavata at gmail.com> wrote:
>
> With all blast hits included, the output file is around 1 gigabyte.
> Therefore just opening and searching for the broken parts is challenging
> with regular text editors. Furthermore, I'm not very familiar with XML
> syntax and therefore would probably not recognize the broken parts.
There is probably a neat way to extract a chunk using Unix command
line tools. Or just try something like this in Python:
error_line = 82921
input_handle = open("really_big.xml")
output_handle = open("fragment.txt", "w")
for line_number, line in enumerate(input_handle) :
if error_line - 1000 < error_line and error_line < error_line + 1000 :
output_handle.write(line)
input_handle.close()
output_handle.close()
I would still suggest you re-try copying it from the cluster to your
machine, in case it was just a network error corrupting the machine.
> Breaking down the search into smaller parts sounds like a good idea.
> However, I'm also considering writing a more robust script. Would it be
> possible to make the script ignore the broken entries in the XML file and
> skip into next correct one?
I think that will be tricky. Part of idea about XML is it is a strictly defined
file format where there are standards about how to interpret and abort
with bad data. Tolerant XML parsers are considered to be a bad thing.
What should be possible is a simple script that removes the broken
section of the file, giving a (partial) but valid XML file covering most
of the sequences. It might be more effort than just re-doing the search
(in parts this time).
Peter
More information about the Biopython
mailing list