[Biojava-l] [Biojava-dev] [Fwd: large genbank data]

Mark Schreiber markjschreiber at gmail.com
Fri Jul 18 13:17:28 UTC 2008


Was looking on the internet ...

So the Java spec says nothing about an upper limit however the sun JDK
implements String as a char[] (behind the scenes). Therefore I think
that on the Sun JDK with the right amount of RAM you could go to 2^32
(except for string literals as mentioned above) which is 4,294,967,296
characters. So a string of a sequence should be able to get to about 4
billion bases.

Of course if you don't assign enough memory to the JVM ( -Xmx4G) you
won't be able to get close. Of course even if you can assign that much
that doesn't account for all the other Java overhead and all the stuff
Hibernate is doing with proxy classes etc.  Also BioSQL usually
defines sequence as a CLOB so depending on your DB implementation
there may be a limit on that. On a 32 bit machine 4GB is all you can
get per CPU so you would have issues trying to do anything bigger.

Anyhow I know I have stored human chromosome 1 (approx 1 billion bases
in memory).



- Mark

On Fri, Jul 18, 2008 at 6:45 PM, James Carman
<james at carmanconsulting.com> wrote:
> That is a limitation for string literals, not any string.  Correct?
>
> On Fri, Jul 18, 2008 at 4:47 AM, Richard Holland
> <dicknetherlands at gmail.com> wrote:
>> In order to persist to BioSQL, BioJava has to convert the symbol list
>> into a string so that it can pass it to JDBC via Hibernate. Therefore
>> the maximum length of a sequence you wish to persist to BioSQL is the
>> maximum length of a string in Java, which is 65536 (2^16) if you are
>> working in a UTF-8 environment.
>>
>> 2008/7/18 Rey Vincent Babilonia <rvincent at asti.dost.gov.ph>:
>>> Hi Mark,
>>>
>>> What is the maximum sequence length that a RichSequence can handle?
>>>
>>> java -Xms1024m -Xmx1256m -jar loader.jar
>>> .
>>> 16:09:00,173  INFO Loader:296 - D:\AE005174.gbk is readable.
>>> 16:09:06,704  INFO Loader:326 - Loading sequence AE005174 with identifier
>>> 56384585, length 5528445 and alphabet DNA...
>>> org.hibernate.PropertyAccessException: Exception occurred inside getter of
>>> org.biojavax.bio.seq.SimpleRichSequence.sequenceLength
>>>
>>> Rey Vincent Babilonia wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> At first it throws an out of memory exception. My workaround is to
>>>> subdivide the sequence file into individual GenBank files.
>>>>
>>>> The error now is that if a GenBank sequence has an 'empty alphabet', it
>>>> does not get loaded to BioSQL. My workaround is to check if
>>>> sequence.getAlphabet().getName() is DNA.
>>>>
>>>> Thanks.
>>>>
>>>> Mark Schreiber wrote:
>>>>>
>>>>> Hi -
>>>>>
>>>>> Is the code throwing an exception or running out of memory??
>>>>>
>>>>> Can you send an example program and the problem you encounter to the
>>>>> list.
>>>>> - Mark
>>>>>
>>>>> On Thu, May 29, 2008 at 9:53 AM, Rey Vincent Babilonia
>>>>> <rvincent at asti.dost.gov.ph> wrote:
>>>>>>
>>>>>> -------- Original Message --------
>>>>>> Subject: large genbank data
>>>>>> Date: Wed, 28 May 2008 18:02:48 +0800
>>>>>> From: Rey Vincent Babilonia <rvincent at asti.dost.gov.ph>
>>>>>> To: biojava-l at biojava.org
>>>>>>
>>>>>> hi,
>>>>>>
>>>>>> anybody tried uploading a large genbank data (e.g.
>>>>>> ftp://bio-mirror.net/biomirror/genbank/gbbct1.seq.gz) to biosql?
>>>>>> load_seqdatabase.pl of bioperl can do this. i'm switching to biojava and
>>>>>> it can't read the sequence (maybe because it has 30000+ sequences).
>>>>>>
>>>>>> thanks.
>>>>>>
>>>>>> --
>>>>>> /**
>>>>>>  * @author   Rey Vincent P. Babilonia
>>>>>>  * @number   +63 2 426 9760 local 1302
>>>>>>  * @pgp      0x383454CF <at> pgp.mit.edu
>>>>>>  * @project  Philippine Bioinformatics Solutions
>>>>>>  * @program  Philippine e-Science Grid
>>>>>>  * @division Research and Development Division
>>>>>>  * @agency   Advanced Science and Technology Institute
>>>>>>  * @url      http://www.psigrid.gov.ph
>>>>>>  */
>>>>>>
>>>>>>
>>>>>> --
>>>>>> /**
>>>>>>  * @author   Rey Vincent P. Babilonia
>>>>>>  * @number   +63 2 426 9760 local 1302
>>>>>>  * @pgp      0x383454CF <at> pgp.mit.edu
>>>>>>  * @project  Philippine Bioinformatics Solutions
>>>>>>  * @program  Philippine e-Science Grid
>>>>>>  * @division Research and Development Division
>>>>>>  * @agency   Advanced Science and Technology Institute
>>>>>>  * @url      http://www.psigrid.gov.ph
>>>>>>  */
>>>>>>
>>>>>> No virus found in this outgoing message.
>>>>>> Checked by AVG.
>>>>>> Version: 8.0.100 / Virus Database: 269.24.2/1471 - Release Date:
>>>>>> 5/28/2008 5:33 PM
>>>>>>
>>>>>> _______________________________________________
>>>>>> biojava-dev mailing list
>>>>>> biojava-dev at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> /**
>>>  * @author   Rey Vincent P. Babilonia
>>>  * @number   +63 2 426 9760 local 1302
>>>  * @pgp      0x383454CF <at> pgp.mit.edu
>>>  * @project  Philippine Bioinformatics Solutions
>>>  * @program  Philippine e-Science Grid
>>>  * @division Research and Development Division
>>>  * @agency   Advanced Science and Technology Institute
>>>  * @url      http://www.psigrid.gov.ph
>>>  */
>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list