[Biojava-l] [Biojava-dev] [Fwd: large genbank data]

Richard Holland dicknetherlands at gmail.com
Fri Jul 18 15:44:49 UTC 2008


Hmm in that case it must be something else.

Your original mail only posted the first couple of lines of the stack
trace. Could you post the whole thing so we can take a closer look?

2008/7/18 Mark Schreiber <markjschreiber at gmail.com>:
> Was looking on the internet ...
>
> So the Java spec says nothing about an upper limit however the sun JDK
> implements String as a char[] (behind the scenes). Therefore I think
> that on the Sun JDK with the right amount of RAM you could go to 2^32
> (except for string literals as mentioned above) which is 4,294,967,296
> characters. So a string of a sequence should be able to get to about 4
> billion bases.
>
> Of course if you don't assign enough memory to the JVM ( -Xmx4G) you
> won't be able to get close. Of course even if you can assign that much
> that doesn't account for all the other Java overhead and all the stuff
> Hibernate is doing with proxy classes etc.  Also BioSQL usually
> defines sequence as a CLOB so depending on your DB implementation
> there may be a limit on that. On a 32 bit machine 4GB is all you can
> get per CPU so you would have issues trying to do anything bigger.
>
> Anyhow I know I have stored human chromosome 1 (approx 1 billion bases
> in memory).
>
>
>
> - Mark
>
> On Fri, Jul 18, 2008 at 6:45 PM, James Carman
> <james at carmanconsulting.com> wrote:
>> That is a limitation for string literals, not any string.  Correct?
>>
>> On Fri, Jul 18, 2008 at 4:47 AM, Richard Holland
>> <dicknetherlands at gmail.com> wrote:
>>> In order to persist to BioSQL, BioJava has to convert the symbol list
>>> into a string so that it can pass it to JDBC via Hibernate. Therefore
>>> the maximum length of a sequence you wish to persist to BioSQL is the
>>> maximum length of a string in Java, which is 65536 (2^16) if you are
>>> working in a UTF-8 environment.
>>>
>>> 2008/7/18 Rey Vincent Babilonia <rvincent at asti.dost.gov.ph>:
>>>> Hi Mark,
>>>>
>>>> What is the maximum sequence length that a RichSequence can handle?
>>>>
>>>> java -Xms1024m -Xmx1256m -jar loader.jar
>>>> .
>>>> 16:09:00,173  INFO Loader:296 - D:\AE005174.gbk is readable.
>>>> 16:09:06,704  INFO Loader:326 - Loading sequence AE005174 with identifier
>>>> 56384585, length 5528445 and alphabet DNA...
>>>> org.hibernate.PropertyAccessException: Exception occurred inside getter of
>>>> org.biojavax.bio.seq.SimpleRichSequence.sequenceLength
>>>>
>>>> Rey Vincent Babilonia wrote:
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> At first it throws an out of memory exception. My workaround is to
>>>>> subdivide the sequence file into individual GenBank files.
>>>>>
>>>>> The error now is that if a GenBank sequence has an 'empty alphabet', it
>>>>> does not get loaded to BioSQL. My workaround is to check if
>>>>> sequence.getAlphabet().getName() is DNA.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> Is the code throwing an exception or running out of memory??
>>>>>>
>>>>>> Can you send an example program and the problem you encounter to the
>>>>>> list.
>>>>>> - Mark
>>>>>>
>>>>>> On Thu, May 29, 2008 at 9:53 AM, Rey Vincent Babilonia
>>>>>> <rvincent at asti.dost.gov.ph> wrote:
>>>>>>>
>>>>>>> -------- Original Message --------
>>>>>>> Subject: large genbank data
>>>>>>> Date: Wed, 28 May 2008 18:02:48 +0800
>>>>>>> From: Rey Vincent Babilonia <rvincent at asti.dost.gov.ph>
>>>>>>> To: biojava-l at biojava.org
>>>>>>>
>>>>>>> hi,
>>>>>>>
>>>>>>> anybody tried uploading a large genbank data (e.g.
>>>>>>> ftp://bio-mirror.net/biomirror/genbank/gbbct1.seq.gz) to biosql?
>>>>>>> load_seqdatabase.pl of bioperl can do this. i'm switching to biojava and
>>>>>>> it can't read the sequence (maybe because it has 30000+ sequences).
>>>>>>>
>>>>>>> thanks.
>>>>>>>
>>>>>>> --
>>>>>>> /**
>>>>>>>  * @author   Rey Vincent P. Babilonia
>>>>>>>  * @number   +63 2 426 9760 local 1302
>>>>>>>  * @pgp      0x383454CF <at> pgp.mit.edu
>>>>>>>  * @project  Philippine Bioinformatics Solutions
>>>>>>>  * @program  Philippine e-Science Grid
>>>>>>>  * @division Research and Development Division
>>>>>>>  * @agency   Advanced Science and Technology Institute
>>>>>>>  * @url      http://www.psigrid.gov.ph
>>>>>>>  */
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> /**
>>>>>>>  * @author   Rey Vincent P. Babilonia
>>>>>>>  * @number   +63 2 426 9760 local 1302
>>>>>>>  * @pgp      0x383454CF <at> pgp.mit.edu
>>>>>>>  * @project  Philippine Bioinformatics Solutions
>>>>>>>  * @program  Philippine e-Science Grid
>>>>>>>  * @division Research and Development Division
>>>>>>>  * @agency   Advanced Science and Technology Institute
>>>>>>>  * @url      http://www.psigrid.gov.ph
>>>>>>>  */
>>>>>>>
>>>>>>> No virus found in this outgoing message.
>>>>>>> Checked by AVG.
>>>>>>> Version: 8.0.100 / Virus Database: 269.24.2/1471 - Release Date:
>>>>>>> 5/28/2008 5:33 PM
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> biojava-dev mailing list
>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> /**
>>>>  * @author   Rey Vincent P. Babilonia
>>>>  * @number   +63 2 426 9760 local 1302
>>>>  * @pgp      0x383454CF <at> pgp.mit.edu
>>>>  * @project  Philippine Bioinformatics Solutions
>>>>  * @program  Philippine e-Science Grid
>>>>  * @division Research and Development Division
>>>>  * @agency   Advanced Science and Technology Institute
>>>>  * @url      http://www.psigrid.gov.ph
>>>>  */
>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list