[Bioperl-l] Failure when parsing a massive Entrez (GenBank) query.

Guillermo Fernández gylz.mail at gmail.com
Sat Apr 2 03:21:32 UTC 2011


Hello,

I am trying to extract the CDS sequences for a list of GenBank DNA files. I
have used Bio::SeqIO, Bio::DB::Query::GenBank and Bio::DB::GenBank for it.
The entrez query is as follows:

'(acetolactate synthase[All Fields] AND "green plants"[porgn]) AND
"flowering plants"[porgn] AND "complete cds"[All Fields]';

The perl code can be seen in https://gist.github.com/898456(extractFiles.pl)

It run smoothly until an error rises (the output is included next to the
source. I wrote the output when the line "$stream->verbose(2);" is
commented, and uncommented too.) After it, the program dies before all the
sequences had been parsed. I do not know how to overcome this kind of errors
and resume processing the sequences that remain after that point. In
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO.html#POD4is
said that resuming of parsing after catching an exception thrown by
"next_seq" cannot be assumed.

As consequence, I'm looking for alternatives. I downloaded the list of gi
numbers from the NCBI for that query (sequence.gi.txt) and piped it to the
script shown in https://gist.github.com/899123 : "cat sequence.gi.txt |
xargs ./extractFileByGI.pl". It works and overcomes bad formatted sequences
without compromising the remained sequences but it is really slow compared
with the first script.

Could you suggest me an efficient solution?

Thank you in advance.

Guillermo.

*P.S.* Using extractFiles.pl for a small number of sequences, starting some
sequences before the one that seems to fail and ending some sequences after
it, results in a surprisely correct run. (Replace the original NCBI query
-$queryString- with the sublist of gi numbers separated by white spaces:
"115446302 297179983 30693053 223945818 188529638 188529636 188529634
297600179 167118 188529632") Is it a bug? Should it have failed again (as I
expected)?



More information about the Bioperl-l mailing list