[Biojava-l] FASTA parsing bug ?

Wed Apr 29 08:31:00 UTC 2009

People who know me will know I am not a big fan of FASTA format.  Sure
it was useful in the days of FORTRAN but we really need to move on.
I'm not sure the people who started the format foresaw the kind of
abuse that the "format" would get.

What I would much prefer is something that looks like the BioEntry
table of BioSQL plus the BioSequence information in somekind of (dare
I say it) XML format. It would certainly be tidier and vastly more
machine readable, for a start not all the metadata would need to be on
the description line in no specific order. I think by limiting it to
those two tables you get most of the key metadata without all the
cruft that comes with more extensive XML formats.

It would be a bit less user friendly for people pasting sequences into
webforms (although I think FASTA is fine for that) but much better for
data distribution, webservices, machine processing etc.

Anyhow, that's enough venting. I don't wan't to start somekind of holy
war or anything...

- Mark

ps. Sorry for getting off topic.

On Wed, Apr 29, 2009 at 3:13 PM, JP <jp at javaclass.co.uk> wrote:
>
> This is why we all love the internet and the community.
> What is the chance of this happening ?  You are speaking about World Peace,
> and Kofi Annan butts in. :)
>
> I found that strange also (that there are larger headers preceding the
> troublesome one).  Maybe (and this is a long shot) there is some buffer
> which gets filled at that particular record or point in file ?  (Does the
> error move record if we delete a couple of initial Fasta entries ?)
>
> Mind you this is NOT the only flybase fasta file I get errors with (same
> happens with dpse one v2.3 - and I am sure there are others).
>
> I am interested in the solution, so are a ton of other people who use
> biojava and particularly verbose fasta files.
>
> I love flybase and biojava
> JP
>
> On Wed, Apr 29, 2009 at 4:08 AM, Josh Goodman <jogoodma at indiana.edu> wrote:
>
> >
> > Hi Richard and JP,
> >
> > I think I can be of some help as I'm the FlyBase developer responsible for
> > generating these troublesome FASTA files :-).  The cause of this problem
> > appears to be the description line length for the record FBpp0145470.
> >
> > The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop
> > at line 196.  Biojava correctly reads in FBpp0145468 but throws an error
> > when trying to parse FBpp0145469.  There is nothing wrong in FBpp0145469
> > but when biojava reaches the end of the sequence it reads in the header
> > for the next record (FBpp0145470).  It then tries to reset the
> > BufferedReader to the start of FBpp0145470 but that is where the exception
> > is thrown because line 197 sets the read ahead limit to 500 characters and
> > the reader.readLine() command exceeds that limit.
> >
> > What isn't obvious to me is why other large definition lines that precede
> > that line don't throw the same error (e.g. FBpp0157909).  I guess the
> > javadoc on BufferedReader.mark() does say "may fail" but I assumed it
> > would be more predictable than that.
> >
> > The file in question can be downloaded from
> >
> > ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz
> > .
> >
> > If there is interest in a solution that doesn't involve simply upping the
> > read ahead limit I can put a patch file together in the next day or so.
> >
> > Cheers,
> > Josh
> >
> > On Tue, 28 Apr 2009, Richard Holland wrote:
> >
> > > You're right, doesn't look like newlines.
> > >
> > > The "Mark invalid" happens when the parser looks too far ahead in the
> > > file attempting to seek out the next valid sequence to parse. I'm not
> > > sure why this is happening.
> > >
> > > I don't have the time to test right now but if you could post the link
> > > to where someone could download the same FASTA as you're using, then it
> > > would make it possible for someone else to investigate in more detail.
> > >
> > > thanks,
> > > Richard
> > >
> > > JP wrote:
> > > > Thanks Richard for your prompt reply.
> > > >
> > > > I will not attach the fasta file I am parsing (12MB) its
> > > > dgri-all-translation-r1.3.fasta from the flybase project.
> > > >
> > > > If the file had any extra new lines I would see them when I loaded it
> > in
> > > > a text editor - no ?
> > > >
> > > > I implemented the whole thing without using Biojava (for this part)
> > > >
> > > >     fr = new FileReader(fastaProteinFileName);
> > > >     br = new BufferedReader(fr);
> > > >     String fastaLine;
> > > >     String startAccession = '>' + accessionId.trim();
> > > >     String fastaEntry = "";
> > > >     boolean record = false;
> > > >     while ((fastaLine = br.readLine()) != null) {
> > > >         fastaLine = fastaLine.trim() + '\n';
> > > >         if (fastaLine.startsWith(startAccession)) {
> > > >             record = true;
> > > >         } else if (record && fastaLine.startsWith(">")) {
> > > >             record = false;
> > > >             break;
> > > >         }
> > > >         if (record) {
> > > >             fastaEntry += fastaLine;
> > > >         }
> > > >     }
> > > >
> > > >
> > > > Notice - I do not use regex - since I'd need to read the whole file and
> > > > then regex upon it (if the record is the first one - I just read that
> > one).
> > > >
> > > > Cheers
> > > > JP
> > > >
> > > >
> > > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
> > > > <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
> > > >
> > > >     The "Mark invalid" exception is indicating that the parser has gone
> > too
> > > >     far ahead in the file looking for a valid header. I'm not sure why
> > but
> > > >     looking at your original query, there may be extra newlines
> > embedded
> > > >     into your FASTA header line? That would definitely confuse it.
> > > >
> > > >     The parser is not able to currently pull out just one sequence - in
> > > >     effect this is a search facility, which it doesn't have. :(
> > > >
> > > >     thanks,
> > > >     Richard
> > > >
> > > >     JP wrote:
> > > >     > Hi all at BioJava,
> > > >     >
> > > >     > I am trying to parse several FASTA files using the following
> > code:
> > > >     >
> > > >     > fr = new FileReader(fastaProteinFileName);
> > > >     >> br = new BufferedReader(fr);
> > > >     >>
> > > >     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br,
> > null);
> > > >     >> while (protIter.hasNext()) {
> > > >     >>      BioEntry bioEntry = protIter.nextBioEntry();
> > > >     >>      System.out.println (fastaProteinFileName + " == " +
> > > >     accessionId + " =
> > > >     >> " + bioEntry.getAccession());
> > > >     >> }
> > > >     >
> > > >     >
> > > >     > At particular points in my fasta file - I get the following
> > exception:
> > > >     >
> > > >     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception
> > (from
> > > >     >> biojava library)
> > > >     >> org.biojava.bio.BioException: Could not read sequence
> > > >     >>     at
> > > >     >>
> > > >
> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> > > >     >>     at
> > > >     >>
> > > >
> > org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
> > > >     >>     at
> > > >     >>
> > > >
> > edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
> > > >     >>     at
> > > >     >>
> > > >
> > edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
> > > >     >>     at
> > > >     >>
> > > >
> > edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
> > > >     >>     at
> > > >     >>
> > > >
> > edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
> > > >     >> Caused by: java.io.IOException: Mark invalid
> > > >     >>     at java.io.BufferedReader.reset(Unknown Source)
> > > >     >>     at
> > > >     >>
> > > >
> > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
> > > >     >>     at
> > > >     >>
> > > >
> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> > > >     >>     ... 5 more
> > > >     >
> > > >     >
> > > >     > Interestingly if I delete the header portion of the header line
> > (from
> > > >     > type=protein... till the end of the line ...Dgri;)
> > > >     >
> > > >     >> FBpp0145468 type=protein;
> > > >     >>
> > > >
> > loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
> > > >     >> ID=FBpp0145468; name=Dgri\GH11562-PA;
> > parent=FBgn0119042,FBtr0146976;
> > > >     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
> > > >     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
> > > >     >> species=Dgri;
> > > >     >>
> > > >     >
> > > >     > It works - but I have a number of these exceptions (and I do not
> > > >     want to
> > > >     > edit the original data).  Mind you I have longer headers in my
> > > >     file which
> > > >     > are parsed OK (strange!).
> > > >     >
> > > >     > Any ideas anyone ?  Alternatively - is there a better way how to
> > > >     get ONE
> > > >     > SINGLE sequence from the whole fasta file give that I have the
> > > >     accession id
> > > >     > (FBpp0145468) ?
> > > >     >
> > > >     > Many Thanks
> > > >     > JP
> > > >     > _______________________________________________
> > > >     > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > >     <mailto:Biojava-l at lists.open-bio.org>
> > > >     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > > >     >
> > > >
> > > >     --
> > > >     Richard Holland, BSc MBCS
> > > >     Finance Director, Eagle Genomics Ltd
> > > >     T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> > > >     <mailto:holland at eaglegenomics.com>
> > > >     http://www.eaglegenomics.com/
> > > >
> > > >
> > >
> > > --
> > > Richard Holland, BSc MBCS
> > > Finance Director, Eagle Genomics Ltd
> > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> > > http://www.eaglegenomics.com/
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l