[Biojava-l] FASTA parsing bug ?

Wed Apr 29 19:53:10 UTC 2009

A quick note in order to add one more data point.

While looking at this it should be kept in mind that NCBI's nonredundant
database FASTA files (nr.fa and nt.fa) use ctrl-A characters to concatenate
multiple descriptions.  These concatenated descriptions can be thousands of
characters long.  I've got one that I use as a test case that has 378,260
characters (5204 concatenated descriptions).  It's a 98 residue sequence
for "NADH dehydrogenase subunit 4L".

I'm not saying it's right, just that cases like this do exist.

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel at accelrys.com
Accelrys (SciTegic R&D)             mobile: +1 858 205 3653
10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Vice President, Board of Directors:
    International Society for Computational Biology
Co-chair: ISCB Publications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics

> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-
> bounces at lists.open-bio.org] On Behalf Of Josh Goodman
> Sent: Tuesday, 28 April 2009 8:09 PM
> To: Richard Holland
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] FASTA parsing bug ?
> 
> 
> Hi Richard and JP,
> 
> I think I can be of some help as I'm the FlyBase developer responsible for
> generating these troublesome FASTA files :-).  The cause of this problem
> appears to be the description line length for the record FBpp0145470.
> 
> The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop
> at line 196.  Biojava correctly reads in FBpp0145468 but throws an error
> when trying to parse FBpp0145469.  There is nothing wrong in FBpp0145469
> but when biojava reaches the end of the sequence it reads in the header
> for the next record (FBpp0145470).  It then tries to reset the
> BufferedReader to the start of FBpp0145470 but that is where the exception
> is thrown because line 197 sets the read ahead limit to 500 characters and
> the reader.readLine() command exceeds that limit.
> 
> What isn't obvious to me is why other large definition lines that precede
> that line don't throw the same error (e.g. FBpp0157909).  I guess the
> javadoc on BufferedReader.mark() does say "may fail" but I assumed it
> would be more predictable than that.
> 
> The file in question can be downloaded from
> ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fas
> ta/dgri-all-translation-r1.3.fasta.gz.
> 
> If there is interest in a solution that doesn't involve simply upping the
> read ahead limit I can put a patch file together in the next day or so.
> 
> Cheers,
> Josh
> 
> On Tue, 28 Apr 2009, Richard Holland wrote:
> 
> > You're right, doesn't look like newlines.
> >
> > The "Mark invalid" happens when the parser looks too far ahead in the
> > file attempting to seek out the next valid sequence to parse. I'm not
> > sure why this is happening.
> >
> > I don't have the time to test right now but if you could post the link
> > to where someone could download the same FASTA as you're using, then it
> > would make it possible for someone else to investigate in more detail.
> >
> > thanks,
> > Richard
> >
> > JP wrote:
> > > Thanks Richard for your prompt reply.
> > >
> > > I will not attach the fasta file I am parsing (12MB) its
> > > dgri-all-translation-r1.3.fasta from the flybase project.
> > >
> > > If the file had any extra new lines I would see them when I loaded it
> in
> > > a text editor - no ?
> > >
> > > I implemented the whole thing without using Biojava (for this part)
> > >
> > >     fr = new FileReader(fastaProteinFileName);
> > >     br = new BufferedReader(fr);
> > >     String fastaLine;
> > >     String startAccession = '>' + accessionId.trim();
> > >     String fastaEntry = "";
> > >     boolean record = false;
> > >     while ((fastaLine = br.readLine()) != null) {
> > >         fastaLine = fastaLine.trim() + '\n';
> > >         if (fastaLine.startsWith(startAccession)) {
> > >             record = true;
> > >         } else if (record && fastaLine.startsWith(">")) {
> > >             record = false;
> > >             break;
> > >         }
> > >         if (record) {
> > >             fastaEntry += fastaLine;
> > >         }
> > >     }
> > >
> > >
> > > Notice - I do not use regex - since I'd need to read the whole file
> and
> > > then regex upon it (if the record is the first one - I just read that
> one).
> > >
> > > Cheers
> > > JP
> > >
> > >
> > > On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
> > > <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
> > >
> > >     The "Mark invalid" exception is indicating that the parser has
> gone too
> > >     far ahead in the file looking for a valid header. I'm not sure why
> but
> > >     looking at your original query, there may be extra newlines
> embedded
> > >     into your FASTA header line? That would definitely confuse it.
> > >
> > >     The parser is not able to currently pull out just one sequence -
> in
> > >     effect this is a search facility, which it doesn't have. :(
> > >
> > >     thanks,
> > >     Richard
> > >
> > >     JP wrote:
> > >     > Hi all at BioJava,
> > >     >
> > >     > I am trying to parse several FASTA files using the following
> code:
> > >     >
> > >     > fr = new FileReader(fastaProteinFileName);
> > >     >> br = new BufferedReader(fr);
> > >     >>
> > >     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br,
> null);
> > >     >> while (protIter.hasNext()) {
> > >     >>      BioEntry bioEntry = protIter.nextBioEntry();
> > >     >>      System.out.println (fastaProteinFileName + " == " +
> > >     accessionId + " =
> > >     >> " + bioEntry.getAccession());
> > >     >> }
> > >     >
> > >     >
> > >     > At particular points in my fasta file - I get the following
> exception:
> > >     >
> > >     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception
> (from
> > >     >> biojava library)
> > >     >> org.biojava.bio.BioException: Could not read sequence
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader
> .java:113)
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.jav
> a:99)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFr
> omFASTAFile(FastaFileProcessing.java:60)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(Ortholog
> ueFinder.java:64)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.
> java:51)
> > >     >>     at
> > >     >>
> > >
> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(Ortholo
> gueFinderLauncher.java:60)
> > >     >> Caused by: java.io.IOException: Mark invalid
> > >     >>     at java.io.BufferedReader.reset(Unknown Source)
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
> > >     >>     at
> > >     >>
> > >
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader
> .java:110)
> > >     >>     ... 5 more
> > >     >
> > >     >
> > >     > Interestingly if I delete the header portion of the header line
> (from
> > >     > type=protein... till the end of the line ...Dgri;)
> > >     >
> > >     >> FBpp0145468 type=protein;
> > >     >>
> > >
> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13
> 220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..
> 13226463,13226531..13226658);
> > >     >> ID=FBpp0145468; name=Dgri\GH11562-PA;
> parent=FBgn0119042,FBtr0146976;
> > >     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
> > >     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
> > >     >> species=Dgri;
> > >     >>
> > >     >
> > >     > It works - but I have a number of these exceptions (and I do not
> > >     want to
> > >     > edit the original data).  Mind you I have longer headers in my
> > >     file which
> > >     > are parsed OK (strange!).
> > >     >
> > >     > Any ideas anyone ?  Alternatively - is there a better way how to
> > >     get ONE
> > >     > SINGLE sequence from the whole fasta file give that I have the
> > >     accession id
> > >     > (FBpp0145468) ?
> > >     >
> > >     > Many Thanks
> > >     > JP
> > >     > _______________________________________________
> > >     > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >     <mailto:Biojava-l at lists.open-bio.org>
> > >     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >     >
> > >
> > >     --
> > >     Richard Holland, BSc MBCS
> > >     Finance Director, Eagle Genomics Ltd
> > >     T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> > >     <mailto:holland at eaglegenomics.com>
> > >     http://www.eaglegenomics.com/
> > >
> > >
> >
> > --
> > Richard Holland, BSc MBCS
> > Finance Director, Eagle Genomics Ltd
> > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> > http://www.eaglegenomics.com/
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l