[Biojava-l] RichSequenceIterator.nextSequence does not move to next sequence when an exception is thrown
James Carman
james at carmanconsulting.com
Thu Jul 10 11:30:32 UTC 2008
Ooooh. That's nasty. I just re-wrote one of our "loaders" because it
was doing exactly that, breaking the file up into records and then
using the parser to parse each one individually. I guess that's why
they were doing that. I'll have to back out my changes. Good to
know! Perhaps they should have put in a comment?! :)
On Thu, Jul 10, 2008 at 6:21 AM, Richard Holland
<dicknetherlands at gmail.com> wrote:
> Hello. You appear to have hit a bit of a limitation with the system.
> The sequence iterator doesn't know how to skip over bad records (in
> fact, the parsers themselves do not - they just give up at the first
> sign of a failed line). I'll have to have a think about how to fix
> this, as it's not immediately obvious (although it definitely needs to
> be done).
>
> cheers,
> Richard
>
> 2008/7/10 Martin Jones <martin.jones at ed.ac.uk>:
>> Hi,
>>
>> I have a file containing GenBank records, and I want to process them thus:
>>
>> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(myReader,
>> null);
>> while (seqs.hasNext()) {
>> RichSequence seq = seqs.nextRichSequence();
>> // processing code
>> }
>>
>> however, some records cannot be parsed by biojava... this is to be expected
>> as I'm processing half a million records - some are bound to be wonky. So I
>> use a try-catch to skip over troublesome records:
>>
>>
>> RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(myReader,
>> null);
>> while (seqs.hasNext()) {
>> try{
>> RichSequence seq = seqs.nextRichSequence();
>> // processing code
>> } catch (BioException e){
>> System.out.println("record count not be parsed!");
>> }
>> }
>>
>> However, it seems that the position in the input file is not changed if an
>> exception is thrown during parsing. If I run the above code on a file
>> containing a single un-parseable record, it gets stuck in a non-terminating
>> loop - i.e. each time seqs.nextRichSequence() is called, an exception is
>> thrown, but seqs.hasNext() still returns true. Is there a correct way to
>> deal with this? I could split up my input file into multiple records and do
>> something like:
>>
>> ArrayList<String> records = splitGenBankFileIntoRecords();
>> for (String singleRecord : records){
>> BufferedReader singleRecordReader = new BufferedReader(new
>> StringReader(singleRecord));
>> RichSequenceIterator seqs =
>> RichSequence.IOTools.readGenbankDNA(singleRecordReader, null);
>> try{
>> RichSequence seq = seqs.nextRichSequence();
>> // processing code
>> } catch (BioException e){
>> System.out.println("record count not be parsed!");
>> }
>>
>> }
>>
>> but this seems inefficient, as I have to instantiate a new StringReader,
>> BufferedReader and RichSequenceIterator for every record (half a milion
>> cycles of object creation/destruction!)
>>
>> Any ideas?
>>
>>
>>
>> --
>> ------------------------
>>
>> Martin Jones
>> School of Biological Sciences,
>> Ashworth Laboratories, King's Buildings
>> Edinburgh, EH9 3JT, UK
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
More information about the Biojava-l
mailing list