[Biojava-l] FASTA parsing bug ?

Tue Apr 28 14:27:36 UTC 2009

The "Mark invalid" exception is indicating that the parser has gone too
far ahead in the file looking for a valid header. I'm not sure why but
looking at your original query, there may be extra newlines embedded
into your FASTA header line? That would definitely confuse it.

The parser is not able to currently pull out just one sequence - in
effect this is a search facility, which it doesn't have. :(

thanks,
Richard

JP wrote:
> Hi all at BioJava,
> 
> I am trying to parse several FASTA files using the following code:
> 
> fr = new FileReader(fastaProteinFileName);
>> br = new BufferedReader(fr);
>>
>> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
>> while (protIter.hasNext()) {
>>      BioEntry bioEntry = protIter.nextBioEntry();
>>      System.out.println (fastaProteinFileName + " == " + accessionId + " =
>> " + bioEntry.getAccession());
>> }
> 
> 
> At particular points in my fasta file - I get the following exception:
> 
> 14:53:42,546 ERROR FastaFileProcessing  - File parsing exception (from
>> biojava library)
>> org.biojava.bio.BioException: Could not read sequence
>>     at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>>     at
>> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>>     at
>> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>>     at
>> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>>     at
>> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>>     at
>> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
>> Caused by: java.io.IOException: Mark invalid
>>     at java.io.BufferedReader.reset(Unknown Source)
>>     at
>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>>     at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>>     ... 5 more
> 
> 
> Interestingly if I delete the header portion of the header line (from
> type=protein... till the end of the line ...Dgri;)
> 
>> FBpp0145468 type=protein;
>> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
>> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
>> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
>> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
>> species=Dgri;
>>
> 
> It works - but I have a number of these exceptions (and I do not want to
> edit the original data).  Mind you I have longer headers in my file which
> are parsed OK (strange!).
> 
> Any ideas anyone ?  Alternatively - is there a better way how to get ONE
> SINGLE sequence from the whole fasta file give that I have the accession id
> (FBpp0145468) ?
> 
> Many Thanks
> JP
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/