[Bioperl-l] Failure when parsing a massive Entrez (GenBank) query.

Sat Apr 2 03:44:57 UTC 2011

One alternative is to download the raw GenBank files and parse them; it's very possible one of them is breaking the parser (if so, please report it).  One way to do this is by using Bio::DB::EUtilities, the cookbook has a few examples:

http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook

chris

On Apr 1, 2011, at 10:21 PM, Guillermo Fernández wrote:

> Hello,
> 
> I am trying to extract the CDS sequences for a list of GenBank DNA files. I
> have used Bio::SeqIO, Bio::DB::Query::GenBank and Bio::DB::GenBank for it.
> The entrez query is as follows:
> 
> '(acetolactate synthase[All Fields] AND "green plants"[porgn]) AND
> "flowering plants"[porgn] AND "complete cds"[All Fields]';
> 
> The perl code can be seen in https://gist.github.com/898456(extractFiles.pl)
> 
> It run smoothly until an error rises (the output is included next to the
> source. I wrote the output when the line "$stream->verbose(2);" is
> commented, and uncommented too.) After it, the program dies before all the
> sequences had been parsed. I do not know how to overcome this kind of errors
> and resume processing the sequences that remain after that point. In
> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO.html#POD4is
> said that resuming of parsing after catching an exception thrown by
> "next_seq" cannot be assumed.
> 
> As consequence, I'm looking for alternatives. I downloaded the list of gi
> numbers from the NCBI for that query (sequence.gi.txt) and piped it to the
> script shown in https://gist.github.com/899123 : "cat sequence.gi.txt |
> xargs ./extractFileByGI.pl". It works and overcomes bad formatted sequences
> without compromising the remained sequences but it is really slow compared
> with the first script.
> 
> Could you suggest me an efficient solution?
> 
> Thank you in advance.
> 
> Guillermo.
> 
> *P.S.* Using extractFiles.pl for a small number of sequences, starting some
> sequences before the one that seems to fail and ending some sequences after
> it, results in a surprisely correct run. (Replace the original NCBI query
> -$queryString- with the sublist of gi numbers separated by white spaces:
> "115446302 297179983 30693053 223945818 188529638 188529636 188529634
> 297600179 167118 188529632") Is it a bug? Should it have failed again (as I
> expected)?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l