[Biojava-l] WriteFasta

Saif Ur-Rehman su24 at st-andrews.ac.uk
Fri Oct 5 13:44:29 UTC 2007


Setting the System properties solved the problem.

Thanks a lot,

Saif

Quoting Richard Holland <holland at ebi.ac.uk>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Great, thanks.
>
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
>
> This is a hex dump of the file:
>
> 00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
> unkno|
> 00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
> [Arab|
> 00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
> thaliana|
> 00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|
>
>
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
>
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
>
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
>
> System.getProperty("file.encode");
>
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
>
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
> > Dear Richard,
> >
> > The input file is just the entire set of RefSeq proteins for Arabdopsis
> thaliana
> > and is too large for me to send as an attachment. But it can be downloaded
> from
> > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> >
> > Cheers,
> >
> > Saif
> >
> >
> >
> > Quoting Richard Holland <holland at ebi.ac.uk>:
> >
> > Interesting. Could you send your input file as well?
> >
> > cheers,
> > Richard
> >
> > Saif Ur-Rehman wrote:
> >>>> Dear Richard,
> >>>>
> >>>> The sequences are being read by SeqIO.readFasta. The code from read to
> > write is
> >>>> as follows. Essentially the program wants to read in a fasta file
> > containing
> >>>> all the protein sequences in a given organism and split them up into one
> > file
> >>>> per protein.
> >>>>
> >>>>
> >>>> BufferedReader br=null;
> >>>> try
> >>>> {
> >>>> br = new BufferedReader(new FileReader(filename));
> >>>> }
> >>>> catch (FileNotFoundException e1)
> >>>> {
> >>>>
> >>>> e1.printStackTrace();
> >>>> }
> >>>>
> >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
> >>>> 	while (stream.hasNext())
> >>>>     {
> >>>> 	    try
> >>>>         {
> >>>> 			Sequence seq = stream.nextSequence();
> >>>>            File scriptFile1= new
> > File("///Users/Saif/Organisms/RunTemp/"+name
> >>>> +"/"+seq.getName());
> >>>>
> >>>> 			try
> >>>>            {
> >>>> 				scriptFile1.createNewFile();
> >>>> 			 }
> >>>>          catch (IOException e1)
> >>>>          {
> >>>>
> >>>> 				e1.printStackTrace();
> >>>> 			}
> >>>>
> >>>> 			try
> >>>>           {
> >>>>            FileWriter fstream = new
> > FileWriter(scriptFile1.getAbsolutePath());
> >>>> 			    BufferedWriter out = new BufferedWriter(fstream);
> >>>>
> >>>> 			    FileOutputStream f =new FileOutputStream (scriptFile1);
> >>>>
> >>>> 			    RichSequence rs=RichSequence.Tools.enrich(seq);
> >>>>
> >>>>
> >>>> 			    try{
> >>>>
> >>>>
> >>>> 			    	RichSequence.IOTools.writeFasta(
> >>>> 			    	        f,
> >>>> 			    	        rs,
> >>>> 			    	        RichObjectFactory.getDefaultNamespace()
> >>>> 			    	        );
> >>>>
> >>>>
> >>>> 			    }
> >>>>
> >>>> 			    catch (IOException ioe){}
> >>>>
> >>>> An example of an outputted fasta file from this code is attached.
> >>>>
> >>>>
> >>>>
> >>>> Thanks a lot for your time.
> >>>>
> >>>> Saif
> >>>>
> >>>>
> >>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>
> >>>> Where are the input sequences coming from? i.e. what method are you
> >>>> using to construct them or read them from a file.
> >>>>
> >>>> Also, what do you mean by the 'front' of each write? Could you send me
> >>>> an example of an entire FASTA file containing the problem? (It'd be best
> >>>> to attach the file to an email to me personally as this list will not
> >>>> accept attachments, and copying-and-pasting from a text editor to an
> >>>> email client may obscure the underlying problem).
> >>>>
> >>>> It'd be good also to see your entire code from the point the sequences
> >>>> are read or created to the point where they are written out. Or, a
> >>>> sample program which exhibits the same behaviour would suffice.
> >>>>
> >>>> I suspect that the sequences themselves contain the incorrect data,
> >>>> although technically this should be impossible as the sequence alphabet
> >>>> should prevent it.
> >>>>
> >>>> We recently had an issue reported here regarding BioJava not being able
> >>>> to do certain sequence tasks on platforms using non-Western-European
> >>>> character mappings. If your machine is running such a mapping, try it
> >>>> again on a machine with an English or other Western European language
> >>>> set up by default. If it works there but not on your machine, then
> >>>> this'll be the same problem. (There is no solution yet, but at least
> >>>> you'll know what's wrong).
> >>>>
> >>>> cheers,
> >>>> Richard
> >>>>
> >>>> Saif Ur-Rehman wrote:
> >>>>>>> Dear Richard,
> >>>>>>>
> >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this
> method
> > is
> >>>> still
> >>>>>>> appending the characters "??" to the front of each write. I am using
> a
> >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like
> so.
> >>>>>>>
> >>>>>>>
> >>>>>>>  Sequence seq; // read in from File
> >>>>>>>  FileOutputStream f =new FileOutputStream (fileName);
> >>>>>>>
> >>>>>>>
> >>>>>>> 			   try{
> >>>>>>>
> >>>>>>> 			    	RichSequence.IOTools.writeFasta(f,
> >>>>>>> 			    	        seq,
> >>>>>>> 			    	        RichObjectFactory.getDefaultNamespace()
> >>>>>>> 			    	        );
> >>>>>>>
> >>>>>>>
> >>>>>>> 			    }
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks a lot for your time
> >>>>>>>
> >>>>>>> Sincerely,
> >>>>>>>
> >>>>>>> Saif
> >>>>>>>
> >>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>>>>
> >>>>>>> SeqIOTools is deprecated.
> >>>>>>>
> >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
> >>>>>>>
> >>>>>>> e.g.:
> >>>>>>>
> >>>>>>> RichSequence.IOTools.writeFasta(
> >>>>>>> 	System.out,
> >>>>>>> 	seq,
> >>>>>>> 	RichObjectFactory.getDefaultNamespace()
> >>>>>>> 	);
> >>>>>>>
> >>>>>>> where seq is either a Sequence or a SequenceIterator.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>>>
> >>>>>>> Saif Ur-Rehman wrote:
> >>>>>>>>>> Dear All,
> >>>>>>>>>>
> >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I
> am
> >>>>>>> currently
> >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file
> per
> > gene
> >>>>>>> for
> >>>>>>>>>> further analysis. However the writeFasta method appears to append
> the
> >>>>>>>>>> characters
> >>>>>>>>>> "¨Ì
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------------------------------------------
> >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>>>>
> >>
>
-------------------------------------------------------------------------------
> >>>>>>> Saif Ur-Rehman
> >>>>>>> Research Student
> >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>>>>> Dyers Brae
> >>>>>>> School of Biology
> >>>>>>> The University of St Andrews
> >>>>>>> St Andrews,
> >>>>>>> Fife
> >>>>>>> Scotland,UK
> >>>>>>> ------------------------------------------------------------------
> >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
-------------------------------------------------------------------------------
> >>>> Saif Ur-Rehman
> >>>> Research Student
> >>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>> Dyers Brae
> >>>> School of Biology
> >>>> The University of St Andrews
> >>>> St Andrews,
> >>>> Fife
> >>>> Scotland,UK
> >>>> ------------------------------------------------------------------
> >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
> >
>
-------------------------------------------------------------------------------
> > Saif Ur-Rehman
> > Research Student
> > The Centre for Evolution, Genes & Genomics (CEGG)
> > Dyers Brae
> > School of Biology
> > The University of St Andrews
> > St Andrews,
> > Fife
> > Scotland,UK
>
> > ------------------------------------------------------------------
> > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
> pxAPAybISoRQgbvQ1wyzqVg=
> =MS7P
> -----END PGP SIGNATURE-----
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk




More information about the Biojava-l mailing list