[Biojava-l] Error parsing ipi.HUMAN.fasta file

Fri Dec 18 15:53:25 UTC 2009

I'm wanting to parse a fasta file obtained from IPI using the code at 
the bottom of this message, but I get the following error:

org.biojava.bio.BioException: Could not read sequence
	at 
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
	at test.readFasta(test.java:39)
	at test.main(test.java:18)
Caused by: java.io.IOException: Mark invalid
	at java.io.BufferedReader.reset(BufferedReader.java:485)
	at 
org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
	at 
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
	... 2 more

Looking at the Fasta file itself and doing some tests, it seems to fail 
consistently at one or two entries /preceding/ an entry with a very long 
description line e.g.:
 >IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394 Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase PPT2
MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW
LS

Deleting the large entries allows the code to continue until it reaches 
another long description line.

It also seems to be a feature of large Fasta files as reading the above 
sequence alone or as part of a small file is fine.

Is this a known problem or am I doing something wrong? BTW I'm using 
biojava 1.7 and Java 1.6.0_17.
Any help would be most appreciated.
Cheers.

code:
import java.io.*;

import org.biojava.bio.*;
import org.biojavax.*;
import org.biojavax.bio.seq.*;

public class test {
    private static PrintStream o = System.out;

    public static void main(String[] args) {
       // TODO Auto-generated method stub
       readFasta(args[0]);
    }

    public static void readFasta(String filename) {
       try {
          o.println("Reading file: " + filename);
          //prepare a BufferedReader for file io
          BufferedReader br = new BufferedReader(new FileReader(filename));

          // read Fasta file as BioJava RichSequence object
          Namespace ns = RichObjectFactory.getDefaultNamespace();
          RichSequenceIterator iter = 
RichSequence.IOTools.readFastaProtein(br,ns);

          int numProteins = 0;
          while(iter.hasNext()) {
             ++numProteins;

             // Retrieve sequence and description data
             RichSequence seq = iter.nextRichSequence();
             String ipi = seq.getName().substring(4,15);
             o.println(ipi);

          }
          o.println("Found " + numProteins + " in Fasta file");
      } catch (FileNotFoundException ex) {
         //can't find file specified by args[0]
         ex.printStackTrace();
      } catch (BioException ex) {
         //error parsing requested format
         ex.printStackTrace();
    }
}

}