[Biojava-l] RichSequenceIterator.nextSequence does not move to next sequence when an exception is thrown

Thu Jul 10 11:30:32 UTC 2008

Ooooh.  That's nasty.  I just re-wrote one of our "loaders" because it
was doing exactly that, breaking the file up into records and then
using the parser to parse each one individually.  I guess that's why
they were doing that.  I'll have to back out my changes.  Good to
know!  Perhaps they should have put in a comment?! :)

On Thu, Jul 10, 2008 at 6:21 AM, Richard Holland
<dicknetherlands at gmail.com> wrote:
> Hello. You appear to have hit a bit of a limitation with the system.
> The sequence iterator doesn't know how to skip over bad records (in
> fact, the parsers themselves do not - they just give up at the first
> sign of a failed line). I'll have to have a think about how to fix
> this, as it's not immediately obvious (although it definitely needs to
> be done).
>
> cheers,
> Richard
>
> 2008/7/10 Martin Jones <martin.jones at ed.ac.uk>:
>> Hi,
>>
>> I have a file containing GenBank records, and I want to process them thus:
>>
>> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(myReader,
>> null);
>> while (seqs.hasNext()) {
>>     RichSequence seq = seqs.nextRichSequence();
>>     // processing code
>> }
>>
>> however, some records cannot be parsed by biojava... this is to be expected
>> as I'm processing half a million records - some are bound to be wonky.  So I
>> use a try-catch to skip over troublesome records:
>>
>>
>> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(myReader,
>> null);
>> while (seqs.hasNext()) {
>>     try{
>>         RichSequence seq = seqs.nextRichSequence();
>>         // processing code
>>     } catch (BioException e){
>>          System.out.println("record count not be parsed!");
>>     }
>> }
>>
>> However, it seems that the position in the input file is not changed if an
>> exception is thrown during parsing.  If I run the above code on a file
>> containing a single un-parseable record, it gets stuck in a non-terminating
>> loop - i.e. each time seqs.nextRichSequence() is called, an exception is
>> thrown, but seqs.hasNext() still returns true.  Is there a correct way to
>> deal with this?  I could split up my input file into multiple records and do
>> something like:
>>
>> ArrayList<String> records = splitGenBankFileIntoRecords();
>> for (String singleRecord : records){
>>     BufferedReader singleRecordReader = new BufferedReader(new
>> StringReader(singleRecord));
>>     RichSequenceIterator seqs =
>> RichSequence.IOTools.readGenbankDNA(singleRecordReader, null);
>>     try{
>>          RichSequence seq = seqs.nextRichSequence();
>>          // processing code
>>     } catch (BioException e){
>>          System.out.println("record count not be parsed!");
>>     }
>>
>> }
>>
>> but this seems inefficient, as I have to instantiate a new StringReader,
>> BufferedReader and RichSequenceIterator for every record (half a milion
>> cycles of object creation/destruction!)
>>
>> Any ideas?
>>
>>
>>
>> --
>> ------------------------
>>
>> Martin Jones
>> School of Biological Sciences,
>> Ashworth Laboratories, King's Buildings
>> Edinburgh, EH9 3JT, UK
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>