[Biojava-l] FASTA parsing bug ?

Tue Apr 28 14:01:04 UTC 2009

Hi all at BioJava,

I am trying to parse several FASTA files using the following code:

fr = new FileReader(fastaProteinFileName);
> br = new BufferedReader(fr);
>
> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
> while (protIter.hasNext()) {
>      BioEntry bioEntry = protIter.nextBioEntry();
>      System.out.println (fastaProteinFileName + " == " + accessionId + " =
> " + bioEntry.getAccession());
> }

At particular points in my fasta file - I get the following exception:

14:53:42,546 ERROR FastaFileProcessing  - File parsing exception (from
> biojava library)
> org.biojava.bio.BioException: Could not read sequence
>     at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>     at
> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>     at
> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>     at
> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>     at
> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>     at
> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
> Caused by: java.io.IOException: Mark invalid
>     at java.io.BufferedReader.reset(Unknown Source)
>     at
> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>     at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>     ... 5 more

Interestingly if I delete the header portion of the header line (from
type=protein... till the end of the line ...Dgri;)

>FBpp0145468 type=protein;
> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
> species=Dgri;
>

It works - but I have a number of these exceptions (and I do not want to
edit the original data).  Mind you I have longer headers in my file which
are parsed OK (strange!).

Any ideas anyone ?  Alternatively - is there a better way how to get ONE
SINGLE sequence from the whole fasta file give that I have the accession id
(FBpp0145468) ?

Many Thanks
JP