[Biojava-l] FASTA parsing bug ?

Tue Apr 28 15:21:25 UTC 2009

You're right, doesn't look like newlines.

The "Mark invalid" happens when the parser looks too far ahead in the
file attempting to seek out the next valid sequence to parse. I'm not
sure why this is happening.

I don't have the time to test right now but if you could post the link
to where someone could download the same FASTA as you're using, then it
would make it possible for someone else to investigate in more detail.

thanks,
Richard

JP wrote:
> Thanks Richard for your prompt reply.
> 
> I will not attach the fasta file I am parsing (12MB) its
> dgri-all-translation-r1.3.fasta from the flybase project.
> 
> If the file had any extra new lines I would see them when I loaded it in
> a text editor - no ?
> 
> I implemented the whole thing without using Biojava (for this part)
> 
>     fr = new FileReader(fastaProteinFileName);
>     br = new BufferedReader(fr);
>     String fastaLine;
>     String startAccession = '>' + accessionId.trim();
>     String fastaEntry = "";
>     boolean record = false;
>     while ((fastaLine = br.readLine()) != null) {
>         fastaLine = fastaLine.trim() + '\n';
>         if (fastaLine.startsWith(startAccession)) {
>             record = true;
>         } else if (record && fastaLine.startsWith(">")) {
>             record = false;
>             break;
>         }
>         if (record) {
>             fastaEntry += fastaLine;
>         }
>     }
> 
> 
> Notice - I do not use regex - since I'd need to read the whole file and
> then regex upon it (if the record is the first one - I just read that one).
> 
> Cheers
> JP
> 
> 
> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
> <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
> 
>     The "Mark invalid" exception is indicating that the parser has gone too
>     far ahead in the file looking for a valid header. I'm not sure why but
>     looking at your original query, there may be extra newlines embedded
>     into your FASTA header line? That would definitely confuse it.
> 
>     The parser is not able to currently pull out just one sequence - in
>     effect this is a search facility, which it doesn't have. :(
> 
>     thanks,
>     Richard
> 
>     JP wrote:
>     > Hi all at BioJava,
>     >
>     > I am trying to parse several FASTA files using the following code:
>     >
>     > fr = new FileReader(fastaProteinFileName);
>     >> br = new BufferedReader(fr);
>     >>
>     >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
>     >> while (protIter.hasNext()) {
>     >>      BioEntry bioEntry = protIter.nextBioEntry();
>     >>      System.out.println (fastaProteinFileName + " == " +
>     accessionId + " =
>     >> " + bioEntry.getAccession());
>     >> }
>     >
>     >
>     > At particular points in my fasta file - I get the following exception:
>     >
>     > 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception (from
>     >> biojava library)
>     >> org.biojava.bio.BioException: Could not read sequence
>     >>     at
>     >>
>     org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>     >>     at
>     >>
>     org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>     >>     at
>     >>
>     edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>     >>     at
>     >>
>     edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>     >>     at
>     >>
>     edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>     >>     at
>     >>
>     edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
>     >> Caused by: java.io.IOException: Mark invalid
>     >>     at java.io.BufferedReader.reset(Unknown Source)
>     >>     at
>     >>
>     org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>     >>     at
>     >>
>     org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>     >>     ... 5 more
>     >
>     >
>     > Interestingly if I delete the header portion of the header line (from
>     > type=protein... till the end of the line ...Dgri;)
>     >
>     >> FBpp0145468 type=protein;
>     >>
>     loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
>     >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
>     >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
>     >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
>     >> species=Dgri;
>     >>
>     >
>     > It works - but I have a number of these exceptions (and I do not
>     want to
>     > edit the original data).  Mind you I have longer headers in my
>     file which
>     > are parsed OK (strange!).
>     >
>     > Any ideas anyone ?  Alternatively - is there a better way how to
>     get ONE
>     > SINGLE sequence from the whole fasta file give that I have the
>     accession id
>     > (FBpp0145468) ?
>     >
>     > Many Thanks
>     > JP
>     > _______________________________________________
>     > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>     <mailto:Biojava-l at lists.open-bio.org>
>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>     >
> 
>     --
>     Richard Holland, BSc MBCS
>     Finance Director, Eagle Genomics Ltd
>     T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>     <mailto:holland at eaglegenomics.com>
>     http://www.eaglegenomics.com/
> 
> 

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/