[Biojava-l] Error parsing ipi.HUMAN.fasta file

Josh Goodman jogoodma at indiana.edu
Fri Dec 18 16:44:56 UTC 2009


Hi Chris,

I've run into this problem before.

See http://lists.open-bio.org/pipermail/biojava-l/2009-May/006834.html for details and some
unofficial patches that fix the problem.

Josh


On 12/18/2009 10:53 AM, Chris Cole wrote:
> I'm wanting to parse a fasta file obtained from IPI using the code at
> the bottom of this message, but I get the following error:
> 
> org.biojava.bio.BioException: Could not read sequence
>     at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> 
>     at test.readFasta(test.java:39)
>     at test.main(test.java:18)
> Caused by: java.io.IOException: Mark invalid
>     at java.io.BufferedReader.reset(BufferedReader.java:485)
>     at
> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>     at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> 
>     ... 2 more
> 
> Looking at the Fasta file itself and doing some tests, it seems to fail
> consistently at one or two entries /preceding/ an entry with a very long
> description line e.g.:
>>IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394
> Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase PPT2
> MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
> ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
> LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
> DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
> FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW
> LS
> 
> Deleting the large entries allows the code to continue until it reaches
> another long description line.
> 
> It also seems to be a feature of large Fasta files as reading the above
> sequence alone or as part of a small file is fine.
> 
> Is this a known problem or am I doing something wrong? BTW I'm using
> biojava 1.7 and Java 1.6.0_17.
> Any help would be most appreciated.
> Cheers.
> 
> code:
> import java.io.*;
> 
> import org.biojava.bio.*;
> import org.biojavax.*;
> import org.biojavax.bio.seq.*;
> 
> public class test {
>    private static PrintStream o = System.out;
> 
>    public static void main(String[] args) {
>       // TODO Auto-generated method stub
>       readFasta(args[0]);
>    }
>     
>    public static void readFasta(String filename) {
>       try {
>          o.println("Reading file: " + filename);
>          //prepare a BufferedReader for file io
>          BufferedReader br = new BufferedReader(new FileReader(filename));
> 
>          // read Fasta file as BioJava RichSequence object
>          Namespace ns = RichObjectFactory.getDefaultNamespace();
>          RichSequenceIterator iter =
> RichSequence.IOTools.readFastaProtein(br,ns);
> 
>          int numProteins = 0;
>          while(iter.hasNext()) {
>             ++numProteins;
> 
>             // Retrieve sequence and description data
>             RichSequence seq = iter.nextRichSequence();
>             String ipi = seq.getName().substring(4,15);
>             o.println(ipi);
>            
>          }
>          o.println("Found " + numProteins + " in Fasta file");
>      } catch (FileNotFoundException ex) {
>         //can't find file specified by args[0]
>         ex.printStackTrace();
>      } catch (BioException ex) {
>         //error parsing requested format
>         ex.printStackTrace();
>    }
> }
> 
> }
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l



More information about the Biojava-l mailing list