[Biojava-l] FASTA parsing bug ?

Tue Apr 28 14:59:40 UTC 2009

Thanks Richard for your prompt reply.

I will not attach the fasta file I am parsing (12MB) its
dgri-all-translation-r1.3.fasta from the flybase project.

If the file had any extra new lines I would see them when I loaded it in a
text editor - no ?

I implemented the whole thing without using Biojava (for this part)

fr = new FileReader(fastaProteinFileName);
br = new BufferedReader(fr);
String fastaLine;
String startAccession = '>' + accessionId.trim();
String fastaEntry = "";
boolean record = false;
while ((fastaLine = br.readLine()) != null) {
    fastaLine = fastaLine.trim() + '\n';
    if (fastaLine.startsWith(startAccession)) {
        record = true;
    } else if (record && fastaLine.startsWith(">")) {
        record = false;
        break;
    }
    if (record) {
        fastaEntry += fastaLine;
    }
}

Notice - I do not use regex - since I'd need to read the whole file and then
regex upon it (if the record is the first one - I just read that one).

Cheers
JP

On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
<holland at eaglegenomics.com>wrote:

> The "Mark invalid" exception is indicating that the parser has gone too
> far ahead in the file looking for a valid header. I'm not sure why but
> looking at your original query, there may be extra newlines embedded
> into your FASTA header line? That would definitely confuse it.
>
> The parser is not able to currently pull out just one sequence - in
> effect this is a search facility, which it doesn't have. :(
>
> thanks,
> Richard
>
> JP wrote:
> > Hi all at BioJava,
> >
> > I am trying to parse several FASTA files using the following code:
> >
> > fr = new FileReader(fastaProteinFileName);
> >> br = new BufferedReader(fr);
> >>
> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
> >> while (protIter.hasNext()) {
> >>      BioEntry bioEntry = protIter.nextBioEntry();
> >>      System.out.println (fastaProteinFileName + " == " + accessionId + "
> =
> >> " + bioEntry.getAccession());
> >> }
> >
> >
> > At particular points in my fasta file - I get the following exception:
> >
> > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception (from
> >> biojava library)
> >> org.biojava.bio.BioException: Could not read sequence
> >>     at
> >>
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> >>     at
> >>
> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
> >>     at
> >>
> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
> >>     at
> >>
> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
> >>     at
> >>
> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
> >>     at
> >>
> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
> >> Caused by: java.io.IOException: Mark invalid
> >>     at java.io.BufferedReader.reset(Unknown Source)
> >>     at
> >>
> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
> >>     at
> >>
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> >>     ... 5 more
> >
> >
> > Interestingly if I delete the header portion of the header line (from
> > type=protein... till the end of the line ...Dgri;)
> >
> >> FBpp0145468 type=protein;
> >>
> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
> >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
> >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
> >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
> >> species=Dgri;
> >>
> >
> > It works - but I have a number of these exceptions (and I do not want to
> > edit the original data).  Mind you I have longer headers in my file which
> > are parsed OK (strange!).
> >
> > Any ideas anyone ?  Alternatively - is there a better way how to get ONE
> > SINGLE sequence from the whole fasta file give that I have the accession
> id
> > (FBpp0145468) ?
> >
> > Many Thanks
> > JP
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>