[Biojava-l] BioJava-X parsing of RichSequences

mark.schreiber at novartis.com mark.schreiber at novartis.com
Wed May 3 01:19:02 UTC 2006


Ola Spjuth <ola.spjuth at farmbio.uu.se>
Sent by: biojava-l-bounces at lists.open-bio.org
05/02/2006 09:15 PM

 
        To:     biojava-l <biojava-l at biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] BioJava-X parsing of RichSequences


> 1) I'd like to use Biojava-X with Bioclipse. Are there any problems
> running it with Java 1.5 (as is required by Bioclipse)?

Shouldn't be a problem. Biojava-X doesn't use Java1.5 but JDK1.5 (JRE5.0) 
can run and compile biojava.

>2) I would propose the addition of a readStream(...) method in
>RichSequence.IOTools in addition to readFile(...). For the Bioclipse
>project it would be most useful to be able to guess the format of a
>Stream. As IOTools is marked final it cannot be subclassed.

The reason you cannot do this is because format guessing involves reading 
some data from the source and then either pushing it back or re-opening 
when it has guessed the format. You cannot guarentee a pushback to a 
Stream and you cannot guarentee you could re-open it again. As a hack you 
could read the stream into a temp file and pass that to IOTools. You may 
also be able to read it to a ByteArrayBuffer and read that as a Stream.

>3) Is HashBioEntryDB a suitable base object for storing 1-N
>RichSequences in memory or should I use RichSequence[]? Which solution
>has the simplest toByte() method for writing to e.g. a File?
>
>So, basically I am looking for the most convenient way of doing:
>
>i)   Read byte[] (from a File containing 1-N sequences) into a base
>object in memory (HashBioEntryDB or RichSequence[])
>ii) Write the (HashBioEntryDB or RichSequence[]) to byte[] (and then
>later to File using Bioclipse-methods)
>

The simplist way to read in and write out directly is to take the 
RichSequenceIterator you get from the IOTools read method and pass it 
direct to the IOTools out method of choice. If you want to manipulate data 
in between a RichSequence[] is probably smaller in memory but not as user 
freindly as a DB object.

You should also be aware that RichSequenceIterators are lazy, eg they only 
read data from a file for each request to nextRichSequence(), thus you can 
manipulate each sequence as it comes in and not have to worry about 
running out of memory.

Hope this helps,

- Mark






More information about the Biojava-l mailing list