[Biojava-l] FASTA parsing bug ?

Tue May 12 20:40:40 UTC 2009

Hi all,

I apologize for missing the "next day or so" window but here is my patch.  I'm attaching a patch
file for FastaFormat.java (1.7 tagged branch), the full source file, and a test class.  It seems to
work and performance is on par with the previous approach in my measurements.  One problem that I
couldn't quite figure out a way around is with the guessSymbolTokenization(BufferedInputStream
stream) method.  If the first sequence of the stream has a header length plus first line of sequence
length longer than 2000 characters it will fail to reset properly.

Cheers,
Josh

Richard Holland wrote:
> I'd love to see a proper solution to this that doesn't involve upping
> the read-ahead limit. I was aware that it might be the issue, but had no
> idea why it was not failing for other similar long sequences. I look
> forward to seeing your suggested fix!
> 
> thanks,
> Richard
> 
> Josh Goodman wrote:
>> Hi Richard and JP,
>>
>> I think I can be of some help as I'm the FlyBase developer responsible for 
>> generating these troublesome FASTA files :-).  The cause of this problem 
>> appears to be the description line length for the record FBpp0145470.
>>
>> The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop 
>> at line 196.  Biojava correctly reads in FBpp0145468 but throws an error 
>> when trying to parse FBpp0145469.  There is nothing wrong in FBpp0145469 
>> but when biojava reaches the end of the sequence it reads in the header 
>> for the next record (FBpp0145470).  It then tries to reset the 
>> BufferedReader to the start of FBpp0145470 but that is where the exception 
>> is thrown because line 197 sets the read ahead limit to 500 characters and 
>> the reader.readLine() command exceeds that limit.
>>
>> What isn't obvious to me is why other large definition lines that precede 
>> that line don't throw the same error (e.g. FBpp0157909).  I guess the 
>> javadoc on BufferedReader.mark() does say "may fail" but I assumed it 
>> would be more predictable than that.
>>
>> The file in question can be downloaded from 
>> ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz.
>>
>> If there is interest in a solution that doesn't involve simply upping the 
>> read ahead limit I can put a patch file together in the next day or so.
>>
>> Cheers,
>> Josh
>>
>> On Tue, 28 Apr 2009, Richard Holland wrote:
>>
>>> You're right, doesn't look like newlines.
>>>
>>> The "Mark invalid" happens when the parser looks too far ahead in the
>>> file attempting to seek out the next valid sequence to parse. I'm not
>>> sure why this is happening.
>>>
>>> I don't have the time to test right now but if you could post the link
>>> to where someone could download the same FASTA as you're using, then it
>>> would make it possible for someone else to investigate in more detail.
>>>
>>> thanks,
>>> Richard
>>>
>>> JP wrote:
>>>> Thanks Richard for your prompt reply.
>>>>
>>>> I will not attach the fasta file I am parsing (12MB) its
>>>> dgri-all-translation-r1.3.fasta from the flybase project.
>>>>
>>>> If the file had any extra new lines I would see them when I loaded it in
>>>> a text editor - no ?
>>>>
>>>> I implemented the whole thing without using Biojava (for this part)
>>>>
>>>>     fr = new FileReader(fastaProteinFileName);
>>>>     br = new BufferedReader(fr);
>>>>     String fastaLine;
>>>>     String startAccession = '>' + accessionId.trim();
>>>>     String fastaEntry = "";
>>>>     boolean record = false;
>>>>     while ((fastaLine = br.readLine()) != null) {
>>>>         fastaLine = fastaLine.trim() + '\n';
>>>>         if (fastaLine.startsWith(startAccession)) {
>>>>             record = true;
>>>>         } else if (record && fastaLine.startsWith(">")) {
>>>>             record = false;
>>>>             break;
>>>>         }
>>>>         if (record) {
>>>>             fastaEntry += fastaLine;
>>>>         }
>>>>     }
>>>>
>>>>
>>>> Notice - I do not use regex - since I'd need to read the whole file and
>>>> then regex upon it (if the record is the first one - I just read that one).
>>>>
>>>> Cheers
>>>> JP
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
>>>> <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
>>>>
>>>>     The "Mark invalid" exception is indicating that the parser has gone too
>>>>     far ahead in the file looking for a valid header. I'm not sure why but
>>>>     looking at your original query, there may be extra newlines embedded
>>>>     into your FASTA header line? That would definitely confuse it.
>>>>
>>>>     The parser is not able to currently pull out just one sequence - in
>>>>     effect this is a search facility, which it doesn't have. :(
>>>>
>>>>     thanks,
>>>>     Richard
>>>>
>>>>     JP wrote:
>>>>     > Hi all at BioJava,
>>>>     >
>>>>     > I am trying to parse several FASTA files using the following code:
>>>>     >
>>>>     > fr = new FileReader(fastaProteinFileName);
>>>>     >> br = new BufferedReader(fr);
>>>>     >>
>>>>     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
>>>>     >> while (protIter.hasNext()) {
>>>>     >>      BioEntry bioEntry = protIter.nextBioEntry();
>>>>     >>      System.out.println (fastaProteinFileName + " == " +
>>>>     accessionId + " =
>>>>     >> " + bioEntry.getAccession());
>>>>     >> }
>>>>     >
>>>>     >
>>>>     > At particular points in my fasta file - I get the following exception:
>>>>     >
>>>>     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception (from
>>>>     >> biojava library)
>>>>     >> org.biojava.bio.BioException: Could not read sequence
>>>>     >>     at
>>>>     >>
>>>>     org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>>>>     >>     at
>>>>     >>
>>>>     org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>>>>     >>     at
>>>>     >>
>>>>     edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>>>>     >>     at
>>>>     >>
>>>>     edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>>>>     >>     at
>>>>     >>
>>>>     edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>>>>     >>     at
>>>>     >>
>>>>     edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
>>>>     >> Caused by: java.io.IOException: Mark invalid
>>>>     >>     at java.io.BufferedReader.reset(Unknown Source)
>>>>     >>     at
>>>>     >>
>>>>     org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>>>>     >>     at
>>>>     >>
>>>>     org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>>>>     >>     ... 5 more
>>>>     >
>>>>     >
>>>>     > Interestingly if I delete the header portion of the header line (from
>>>>     > type=protein... till the end of the line ...Dgri;)
>>>>     >
>>>>     >> FBpp0145468 type=protein;
>>>>     >>
>>>>     loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
>>>>     >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
>>>>     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
>>>>     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
>>>>     >> species=Dgri;
>>>>     >>
>>>>     >
>>>>     > It works - but I have a number of these exceptions (and I do not
>>>>     want to
>>>>     > edit the original data).  Mind you I have longer headers in my
>>>>     file which
>>>>     > are parsed OK (strange!).
>>>>     >
>>>>     > Any ideas anyone ?  Alternatively - is there a better way how to
>>>>     get ONE
>>>>     > SINGLE sequence from the whole fasta file give that I have the
>>>>     accession id
>>>>     > (FBpp0145468) ?
>>>>     >
>>>>     > Many Thanks
>>>>     > JP
>>>>     > _______________________________________________
>>>>     > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>     >
>>>>
>>>>     --
>>>>     Richard Holland, BSc MBCS
>>>>     Finance Director, Eagle Genomics Ltd
>>>>     T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>>>     <mailto:holland at eaglegenomics.com>
>>>>     http://www.eaglegenomics.com/
>>>>
>>>>
>>> -- 
>>> Richard Holland, BSc MBCS
>>> Finance Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FastaFormat.patch
Type: text/x-patch
Size: 3402 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20090512/260c0ecf/attachment-0006.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FastaFormat.java
Type: text/x-java
Size: 11394 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20090512/260c0ecf/attachment-0007.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FastaFormatTest.java
Type: text/x-java
Size: 3517 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20090512/260c0ecf/attachment-0008.bin>