[Biojava-l] Error parsing ipi.HUMAN.fasta file
Chris Cole
chris at compbio.dundee.ac.uk
Fri Dec 18 15:53:25 UTC 2009
I'm wanting to parse a fasta file obtained from IPI using the code at
the bottom of this message, but I get the following error:
org.biojava.bio.BioException: Could not read sequence
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
at test.readFasta(test.java:39)
at test.main(test.java:18)
Caused by: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at
org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
... 2 more
Looking at the Fasta file itself and doing some tests, it seems to fail
consistently at one or two entries /preceding/ an entry with a very long
description line e.g.:
>IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394 Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase PPT2
MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW
LS
Deleting the large entries allows the code to continue until it reaches
another long description line.
It also seems to be a feature of large Fasta files as reading the above
sequence alone or as part of a small file is fine.
Is this a known problem or am I doing something wrong? BTW I'm using
biojava 1.7 and Java 1.6.0_17.
Any help would be most appreciated.
Cheers.
code:
import java.io.*;
import org.biojava.bio.*;
import org.biojavax.*;
import org.biojavax.bio.seq.*;
public class test {
private static PrintStream o = System.out;
public static void main(String[] args) {
// TODO Auto-generated method stub
readFasta(args[0]);
}
public static void readFasta(String filename) {
try {
o.println("Reading file: " + filename);
//prepare a BufferedReader for file io
BufferedReader br = new BufferedReader(new FileReader(filename));
// read Fasta file as BioJava RichSequence object
Namespace ns = RichObjectFactory.getDefaultNamespace();
RichSequenceIterator iter =
RichSequence.IOTools.readFastaProtein(br,ns);
int numProteins = 0;
while(iter.hasNext()) {
++numProteins;
// Retrieve sequence and description data
RichSequence seq = iter.nextRichSequence();
String ipi = seq.getName().substring(4,15);
o.println(ipi);
}
o.println("Found " + numProteins + " in Fasta file");
} catch (FileNotFoundException ex) {
//can't find file specified by args[0]
ex.printStackTrace();
} catch (BioException ex) {
//error parsing requested format
ex.printStackTrace();
}
}
}
More information about the Biojava-l
mailing list