[Biojava-l] FASTA parsing bug ?

Wed Apr 29 07:13:02 UTC 2009

This is why we all love the internet and the community.
What is the chance of this happening ?  You are speaking about World Peace,
and Kofi Annan butts in. :)

I found that strange also (that there are larger headers preceding the
troublesome one).  Maybe (and this is a long shot) there is some buffer
which gets filled at that particular record or point in file ?  (Does the
error move record if we delete a couple of initial Fasta entries ?)

Mind you this is NOT the only flybase fasta file I get errors with (same
happens with dpse one v2.3 - and I am sure there are others).

I am interested in the solution, so are a ton of other people who use
biojava and particularly verbose fasta files.

I love flybase and biojava
JP

On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman <jogoodma at indiana.edu> wrote:

>
> Hi Richard and JP,
>
> I think I can be of some help as I'm the FlyBase developer responsible for
> generating these troublesome FASTA files :-).  The cause of this problem
> appears to be the description line length for the record FBpp0145470.
>
> The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop
> at line 196.  Biojava correctly reads in FBpp0145468 but throws an error
> when trying to parse FBpp0145469.  There is nothing wrong in FBpp0145469
> but when biojava reaches the end of the sequence it reads in the header
> for the next record (FBpp0145470).  It then tries to reset the
> BufferedReader to the start of FBpp0145470 but that is where the exception
> is thrown because line 197 sets the read ahead limit to 500 characters and
> the reader.readLine() command exceeds that limit.
>
> What isn't obvious to me is why other large definition lines that precede
> that line don't throw the same error (e.g. FBpp0157909).  I guess the
> javadoc on BufferedReader.mark() does say "may fail" but I assumed it
> would be more predictable than that.
>
> The file in question can be downloaded from
>
> ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz
> .
>
> If there is interest in a solution that doesn't involve simply upping the
> read ahead limit I can put a patch file together in the next day or so.
>
> Cheers,
> Josh
>
> On Tue, 28 Apr 2009, Richard Holland wrote:
>
> > You're right, doesn't look like newlines.
> >
> > The "Mark invalid" happens when the parser looks too far ahead in the
> > file attempting to seek out the next valid sequence to parse. I'm not
> > sure why this is happening.
> >
> > I don't have the time to test right now but if you could post the link
> > to where someone could download the same FASTA as you're using, then it
> > would make it possible for someone else to investigate in more detail.
> >
> > thanks,
> > Richard
> >
> > JP wrote:
> > > Thanks Richard for your prompt reply.
> > >
> > > I will not attach the fasta file I am parsing (12MB) its
> > > dgri-all-translation-r1.3.fasta from the flybase project.
> > >
> > > If the file had any extra new lines I would see them when I loaded it
> in
> > > a text editor - no ?
> > >
> > > I implemented the whole thing without using Biojava (for this part)
> > >
> > >     fr = new FileReader(fastaProteinFileName);
> > >     br = new BufferedReader(fr);
> > >     String fastaLine;
> > >     String startAccession = '>' + accessionId.trim();
> > >     String fastaEntry = "";
> > >     boolean record = false;
> > >     while ((fastaLine = br.readLine()) != null) {
> > >         fastaLine = fastaLine.trim() + '\n';
> > >         if (fastaLine.startsWith(startAccession)) {
> > >             record = true;
> > >         } else if (record && fastaLine.startsWith(">")) {
> > >             record = false;
> > >             break;
> > >         }
> > >         if (record) {
> > >             fastaEntry += fastaLine;
> > >         }
> > >     }
> > >
> > >
> > > Notice - I do not use regex - since I'd need to read the whole file and
> > > then regex upon it (if the record is the first one - I just read that
> one).
> > >
> > > Cheers
> > > JP
> > >
> > >
> > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
> > > <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
> > >
> > >     The "Mark invalid" exception is indicating that the parser has gone
> too
> > >     far ahead in the file looking for a valid header. I'm not sure why
> but
> > >     looking at your original query, there may be extra newlines
> embedded
> > >     into your FASTA header line? That would definitely confuse it.
> > >
> > >     The parser is not able to currently pull out just one sequence - in
> > >     effect this is a search facility, which it doesn't have. :(
> > >
> > >     thanks,
> > >     Richard
> > >
> > >     JP wrote:
> > >     > Hi all at BioJava,
> > >     >
> > >     > I am trying to parse several FASTA files using the following
> code:
> > >     >
> > >     > fr = new FileReader(fastaProteinFileName);
> > >     >> br = new BufferedReader(fr);
> > >     >>
> > >     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br,
> null);
> > >     >> while (protIter.hasNext()) {
> > >     >>      BioEntry bioEntry = protIter.nextBioEntry();
> > >     >>      System.out.println (fastaProteinFileName + " == " +
> > >     accessionId + " =
> > >     >> " + bioEntry.getAccession());
> > >     >> }
> > >     >
> > >     >
> > >     > At particular points in my fasta file - I get the following
> exception:
> > >     >
> > >     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception
> (from
> > >     >> biojava library)
> > >     >> org.biojava.bio.BioException: Could not read sequence
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
> > >     >> Caused by: java.io.IOException: Mark invalid
> > >     >>     at java.io.BufferedReader.reset(Unknown Source)
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> > >     >>     ... 5 more
> > >     >
> > >     >
> > >     > Interestingly if I delete the header portion of the header line
> (from
> > >     > type=protein... till the end of the line ...Dgri;)
> > >     >
> > >     >> FBpp0145468 type=protein;
> > >     >>
> > >
> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
> > >     >> ID=FBpp0145468; name=Dgri\GH11562-PA;
> parent=FBgn0119042,FBtr0146976;
> > >     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
> > >     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
> > >     >> species=Dgri;
> > >     >>
> > >     >
> > >     > It works - but I have a number of these exceptions (and I do not
> > >     want to
> > >     > edit the original data).  Mind you I have longer headers in my
> > >     file which
> > >     > are parsed OK (strange!).
> > >     >
> > >     > Any ideas anyone ?  Alternatively - is there a better way how to
> > >     get ONE
> > >     > SINGLE sequence from the whole fasta file give that I have the
> > >     accession id
> > >     > (FBpp0145468) ?
> > >     >
> > >     > Many Thanks
> > >     > JP
> > >     > _______________________________________________
> > >     > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >     <mailto:Biojava-l at lists.open-bio.org>
> > >     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >     >
> > >
> > >     --
> > >     Richard Holland, BSc MBCS
> > >     Finance Director, Eagle Genomics Ltd
> > >     T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> > >     <mailto:holland at eaglegenomics.com>
> > >     http://www.eaglegenomics.com/
> > >
> > >
> >
> > --
> > Richard Holland, BSc MBCS
> > Finance Director, Eagle Genomics Ltd
> > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> > http://www.eaglegenomics.com/
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>