[Biojava-l] Reading a fasta file which is not encoded in ansi

Richard Holland richard.holland at ebi.ac.uk
Fri Apr 28 13:19:30 UTC 2006


Thinking about this a bit more, I think you meant ASCII when you said
ANSI?

FASTA format is very strictly defined. It is a file containing a number
sequences each with their own header, which starts with a '>' symbol.
You can indeed use any character you like within the header, which ends
at the first new-line after the '>' (newline is ASCII 10 or 13, or both,
depending on your OS). No whitespace is allowed at the start or end of
the file or between or within sequences.

The problem with your file is that the unusual characters are appearing
at the start of the file before the first header, and maybe also during
the sequence itself although I didn't look that closely. Hence it breaks
the FASTA format specification.

The problem here lies with the program that is generating your FASTA
file. BioJava is behaving correctly.

cheers,
Richard

On Fri, 2006-04-28 at 15:00 +0200, Ilhami Visne wrote:
> I thought already to convert the file to ANSI. Sequence part must
> contain only ansi-chararacters but header or other annotaion must not
> contain only ansi characters. if i convert it to ansi, doesn't it may
> cause to lose some data? 
> 
> On 4/28/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
>         I've no idea what binary format that file is in - it contains
>         some very
>         strange characters. It appears to contain _some_ ANSI data but
>         with
>         extra binary bits added to the start and end. I think you need
>         to check
>         the program that generated the file as it is obviously not
>         doing what it
>         is supposed to.
>         
>         Your best bet is to convert the file to ANSI or some other
>         format
>         understood out-of-the-box by Java.
>         
>         cheers,
>         Richard
>         
>         On Fri, 2006-04-28 at 11:09 +0200, Ilhami Visne wrote:
>         > i got a file in fasta format, which is not encoded in ansi.
>         but it seems ok.
>         > it can be downloaded here:
>         http://stud3.tuwien.ac.at/~e0125935/try3.fasta
>         > i tried to read it with SeqIOTools.readFastaDNA and this
>         exception was
>         > thrown:
>         >
>         > org.biojava.bio.BioException: Could not read sequence
>         >     at org.biojava.bio.seq.io.StreamReader.nextSequence
>         (StreamReader.java
>         > :104)
>         > ..............
>         > ..............
>         > Caused by: java.io.IOException: Stream does not appear to
>         contain FASTA
>         > formatted data: ÿþ> 
>         > org.biojava.bio.seq.io.FastaFormat.readSequence
>         (FastaFormat.java:112)
>         >  at org.biojava.bio.seq.io.StreamReader.nextSequence
>         (StreamReader.java:101)
>         >
>         > "ÿþ>" there is no row like this but it seems it is hidden. 
>         >
>         > How should i handle such files?
>         >
>         > thax in advance.
>         >
>         > _______________________________________________
>         > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>         > http://lists.open-bio.org/mailman/listinfo/biojava-l
>         >
>         --
>         Richard Holland (BioMart Team)
>         EMBL-EBI
>         Wellcome Trust Genome Campus
>         Hinxton
>         Cambridge CB10 1SD
>         UNITED KINGDOM
>         Tel: +44-(0)1223-494416
>         
> 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416




More information about the Biojava-l mailing list