[Biojava-dev] Flat File genomic Indexing with OBDA

Matthew Pocock matthew_pocock at yahoo.co.uk
Thu Mar 27 21:29:52 EST 2003


Hi,

Do you have a stack trace for this? The error message
is ... interesting.

Fasta has record start identifiers (> at the start of
a line) but no record end (the record ends when the
next one starts or you hit EOF). Also, header lines
can be very varied in length. So, we do some
skulduggery to work out when header lines end and
sequence lines run into them. It is possible that the
code treats sequence lines very carefully, so
overflows the push-back buffer. Odd.

Enlarging the buffer (as you did) is a bit of a hack.
It will obviously be bad for applications that are
reading short-line fasta files on small-footprint VMs.

Anyway, a stacktrace and dummy file and example
command line would be helpfull. It's hopefully
something dumb. If you send me a tarball (or url for
one) then I'll take a look as I get time.

As for resource usage on your server, you should be
fine. DNA sequences that are a few megabases in size
get packed as binary internally. I've heard of people
holding entire small genomes (e.g. drosophila) in
memory for web services.

Matthew

 --- "Sicotte, Hugues (NIH/NCI)"
<sicotteh at mail.nih.gov> wrote: > 
> I tried to use the indexer
> org.biojava.app.BioFlatIndex on a really long
> genomic sequences and it doesn't work.
> (it worked on my small test sequences, but it
> doesn't like 230Kb sequences!)
> 
> I'm running on a Solaris machine with 4 Gigs of RAM
> and an extra 5Gigs of
> swap space.
> I run java with (to increase the memory of the JVM
> to 2Gigs)
> java -Xms2000m and -Xmx2000m
> org.biojava.app.BioFlatIndex -c -a dna -l
> /usr/tmp/ -d humgen -i flat -f fasta
> /usr/tmp/long.fa
> 
> my test sequence is 230 thousand nucleotide long ..
> on a single line.
> the error message is '46' . 
> I added code to catch 'out of memory' errors.. and
> it's not that.
> 
> I want to write a servlet to retrieve small chunks
> of the human genome.
> I want to use the indexing to get the offset into a
> file, and I use the
> start/stop to
> figure out an additional offset into the file. [I
> already wrote a class file
> that implements that method
> by extending FlatSequenceDB.java .. but that is
> beyond the scope of this
> bug.]
> 
> .. so my fasta files have all the sequence on really
> LONG line. (like
> formatdb for blast does).
> [I was sad that the specification didn't require
> restoring fasta on a single
> line. :( ]
> 
> 
> Do you have any idea which section of the biojava is
> having a limitation on
> line length? (
> and would spit out an error code '46')
> I started debugging, but it's a real nightmare since
> I am not too familiar
> with the biojava data model.
> 
> Hugues Sicotte
> 
> p.s. I'm using the most recent biojava cvs dump from
> last week.
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev 

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com


More information about the biojava-dev mailing list