[Biojava-l] RichSequenceIterator.nextSequence does not move to next sequence when an exception is thrown

Thu Jul 10 10:21:30 UTC 2008

Hello. You appear to have hit a bit of a limitation with the system.
The sequence iterator doesn't know how to skip over bad records (in
fact, the parsers themselves do not - they just give up at the first
sign of a failed line). I'll have to have a think about how to fix
this, as it's not immediately obvious (although it definitely needs to
be done).

cheers,
Richard

2008/7/10 Martin Jones <martin.jones at ed.ac.uk>:
> Hi,
>
> I have a file containing GenBank records, and I want to process them thus:
>
> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(myReader,
> null);
> while (seqs.hasNext()) {
>     RichSequence seq = seqs.nextRichSequence();
>     // processing code
> }
>
> however, some records cannot be parsed by biojava... this is to be expected
> as I'm processing half a million records - some are bound to be wonky.  So I
> use a try-catch to skip over troublesome records:
>
>
> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(myReader,
> null);
> while (seqs.hasNext()) {
>     try{
>         RichSequence seq = seqs.nextRichSequence();
>         // processing code
>     } catch (BioException e){
>          System.out.println("record count not be parsed!");
>     }
> }
>
> However, it seems that the position in the input file is not changed if an
> exception is thrown during parsing.  If I run the above code on a file
> containing a single un-parseable record, it gets stuck in a non-terminating
> loop - i.e. each time seqs.nextRichSequence() is called, an exception is
> thrown, but seqs.hasNext() still returns true.  Is there a correct way to
> deal with this?  I could split up my input file into multiple records and do
> something like:
>
> ArrayList<String> records = splitGenBankFileIntoRecords();
> for (String singleRecord : records){
>     BufferedReader singleRecordReader = new BufferedReader(new
> StringReader(singleRecord));
>     RichSequenceIterator seqs =
> RichSequence.IOTools.readGenbankDNA(singleRecordReader, null);
>     try{
>          RichSequence seq = seqs.nextRichSequence();
>          // processing code
>     } catch (BioException e){
>          System.out.println("record count not be parsed!");
>     }
>
> }
>
> but this seems inefficient, as I have to instantiate a new StringReader,
> BufferedReader and RichSequenceIterator for every record (half a milion
> cycles of object creation/destruction!)
>
> Any ideas?
>
>
>
> --
> ------------------------
>
> Martin Jones
> School of Biological Sciences,
> Ashworth Laboratories, King's Buildings
> Edinburgh, EH9 3JT, UK
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>