[Biojava-l] Error parsing ipi.HUMAN.fasta file

Fri Dec 18 16:58:27 UTC 2009

The FASTA parser has a buffer which it uses to read ahead to the next complete line then back up before it actually parses it on the second pass (in order to allow it to do things like hasNext()). The exception shows that the size of that buffer is being exceeded, causing it to fail to back up again afterwards.

There's two cures - one is to rewrite the FASTA parser to buffer things in a different way. The other is to open up org/biojavax/bio/seq/io/FastaFormat.java in a text editor, search for the line where it sets the buffer (somewhere around line 202 according to the exception, in the readRichSequence() method - the command to look for is 'mark'), and increase the buffer size to something suitably large enough (it's currently set at 500 bytes). Then recompile BioJava and it should work.

cheers,
Richard

On 18 Dec 2009, at 15:53, Chris Cole wrote:

> I'm wanting to parse a fasta file obtained from IPI using the code at the bottom of this message, but I get the following error:
> 
> org.biojava.bio.BioException: Could not read sequence
> 	at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> 	at test.readFasta(test.java:39)
> 	at test.main(test.java:18)
> Caused by: java.io.IOException: Mark invalid
> 	at java.io.BufferedReader.reset(BufferedReader.java:485)
> 	at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
> 	at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> 	... 2 more
> 
> Looking at the Fasta file itself and doing some tests, it seems to fail consistently at one or two entries /preceding/ an entry with a very long description line e.g.:
> >IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394 Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase PPT2
> MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
> ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
> LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
> DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
> FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW
> LS
> 
> Deleting the large entries allows the code to continue until it reaches another long description line.
> 
> It also seems to be a feature of large Fasta files as reading the above sequence alone or as part of a small file is fine.
> 
> Is this a known problem or am I doing something wrong? BTW I'm using biojava 1.7 and Java 1.6.0_17.
> Any help would be most appreciated.
> Cheers.
> 
> code:
> import java.io.*;
> 
> import org.biojava.bio.*;
> import org.biojavax.*;
> import org.biojavax.bio.seq.*;
> 
> public class test {
>   private static PrintStream o = System.out;
> 
>   public static void main(String[] args) {
>      // TODO Auto-generated method stub
>      readFasta(args[0]);
>   }
> 	
>   public static void readFasta(String filename) {
>      try {
>         o.println("Reading file: " + filename);
>         //prepare a BufferedReader for file io
>         BufferedReader br = new BufferedReader(new FileReader(filename));
> 
>         // read Fasta file as BioJava RichSequence object
>         Namespace ns = RichObjectFactory.getDefaultNamespace();
>         RichSequenceIterator iter = RichSequence.IOTools.readFastaProtein(br,ns);
> 
>         int numProteins = 0;
>         while(iter.hasNext()) {
>            ++numProteins;
> 
>            // Retrieve sequence and description data
>            RichSequence seq = iter.nextRichSequence();
>            String ipi = seq.getName().substring(4,15);
>            o.println(ipi);
> 			
>         }
>         o.println("Found " + numProteins + " in Fasta file");
>     } catch (FileNotFoundException ex) {
>        //can't find file specified by args[0]
>        ex.printStackTrace();
>     } catch (BioException ex) {
>        //error parsing requested format
>        ex.printStackTrace();
>   }
> }
> 
> }
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/