[Biojava-l] FASTA parsing bug ?

Josh Goodman jogoodma at indiana.edu
Wed Apr 29 03:08:43 UTC 2009


Hi Richard and JP,

I think I can be of some help as I'm the FlyBase developer responsible for 
generating these troublesome FASTA files :-).  The cause of this problem 
appears to be the description line length for the record FBpp0145470.

The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop 
at line 196.  Biojava correctly reads in FBpp0145468 but throws an error 
when trying to parse FBpp0145469.  There is nothing wrong in FBpp0145469 
but when biojava reaches the end of the sequence it reads in the header 
for the next record (FBpp0145470).  It then tries to reset the 
BufferedReader to the start of FBpp0145470 but that is where the exception 
is thrown because line 197 sets the read ahead limit to 500 characters and 
the reader.readLine() command exceeds that limit.

What isn't obvious to me is why other large definition lines that precede 
that line don't throw the same error (e.g. FBpp0157909).  I guess the 
javadoc on BufferedReader.mark() does say "may fail" but I assumed it 
would be more predictable than that.

The file in question can be downloaded from 
ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz.

If there is interest in a solution that doesn't involve simply upping the 
read ahead limit I can put a patch file together in the next day or so.

Cheers,
Josh

On Tue, 28 Apr 2009, Richard Holland wrote:

> You're right, doesn't look like newlines.
> 
> The "Mark invalid" happens when the parser looks too far ahead in the
> file attempting to seek out the next valid sequence to parse. I'm not
> sure why this is happening.
> 
> I don't have the time to test right now but if you could post the link
> to where someone could download the same FASTA as you're using, then it
> would make it possible for someone else to investigate in more detail.
> 
> thanks,
> Richard
> 
> JP wrote:
> > Thanks Richard for your prompt reply.
> > 
> > I will not attach the fasta file I am parsing (12MB) its
> > dgri-all-translation-r1.3.fasta from the flybase project.
> > 
> > If the file had any extra new lines I would see them when I loaded it in
> > a text editor - no ?
> > 
> > I implemented the whole thing without using Biojava (for this part)
> > 
> >     fr = new FileReader(fastaProteinFileName);
> >     br = new BufferedReader(fr);
> >     String fastaLine;
> >     String startAccession = '>' + accessionId.trim();
> >     String fastaEntry = "";
> >     boolean record = false;
> >     while ((fastaLine = br.readLine()) != null) {
> >         fastaLine = fastaLine.trim() + '\n';
> >         if (fastaLine.startsWith(startAccession)) {
> >             record = true;
> >         } else if (record && fastaLine.startsWith(">")) {
> >             record = false;
> >             break;
> >         }
> >         if (record) {
> >             fastaEntry += fastaLine;
> >         }
> >     }
> > 
> > 
> > Notice - I do not use regex - since I'd need to read the whole file and
> > then regex upon it (if the record is the first one - I just read that one).
> > 
> > Cheers
> > JP
> > 
> > 
> > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
> > <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
> > 
> >     The "Mark invalid" exception is indicating that the parser has gone too
> >     far ahead in the file looking for a valid header. I'm not sure why but
> >     looking at your original query, there may be extra newlines embedded
> >     into your FASTA header line? That would definitely confuse it.
> > 
> >     The parser is not able to currently pull out just one sequence - in
> >     effect this is a search facility, which it doesn't have. :(
> > 
> >     thanks,
> >     Richard
> > 
> >     JP wrote:
> >     > Hi all at BioJava,
> >     >
> >     > I am trying to parse several FASTA files using the following code:
> >     >
> >     > fr = new FileReader(fastaProteinFileName);
> >     >> br = new BufferedReader(fr);
> >     >>
> >     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
> >     >> while (protIter.hasNext()) {
> >     >>      BioEntry bioEntry = protIter.nextBioEntry();
> >     >>      System.out.println (fastaProteinFileName + " == " +
> >     accessionId + " =
> >     >> " + bioEntry.getAccession());
> >     >> }
> >     >
> >     >
> >     > At particular points in my fasta file - I get the following exception:
> >     >
> >     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception (from
> >     >> biojava library)
> >     >> org.biojava.bio.BioException: Could not read sequence
> >     >>     at
> >     >>
> >     org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> >     >>     at
> >     >>
> >     org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
> >     >>     at
> >     >>
> >     edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
> >     >>     at
> >     >>
> >     edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
> >     >>     at
> >     >>
> >     edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
> >     >>     at
> >     >>
> >     edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
> >     >> Caused by: java.io.IOException: Mark invalid
> >     >>     at java.io.BufferedReader.reset(Unknown Source)
> >     >>     at
> >     >>
> >     org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
> >     >>     at
> >     >>
> >     org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> >     >>     ... 5 more
> >     >
> >     >
> >     > Interestingly if I delete the header portion of the header line (from
> >     > type=protein... till the end of the line ...Dgri;)
> >     >
> >     >> FBpp0145468 type=protein;
> >     >>
> >     loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
> >     >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
> >     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
> >     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
> >     >> species=Dgri;
> >     >>
> >     >
> >     > It works - but I have a number of these exceptions (and I do not
> >     want to
> >     > edit the original data).  Mind you I have longer headers in my
> >     file which
> >     > are parsed OK (strange!).
> >     >
> >     > Any ideas anyone ?  Alternatively - is there a better way how to
> >     get ONE
> >     > SINGLE sequence from the whole fasta file give that I have the
> >     accession id
> >     > (FBpp0145468) ?
> >     >
> >     > Many Thanks
> >     > JP
> >     > _______________________________________________
> >     > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >     <mailto:Biojava-l at lists.open-bio.org>
> >     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >     >
> > 
> >     --
> >     Richard Holland, BSc MBCS
> >     Finance Director, Eagle Genomics Ltd
> >     T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> >     <mailto:holland at eaglegenomics.com>
> >     http://www.eaglegenomics.com/
> > 
> > 
> 
> -- 
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 



More information about the Biojava-l mailing list