[Biojava-l] Large RichSequence collection

Thu Aug 1 10:37:46 UTC 2013

Hi,

thanks for you proposal ;)

I have no problem in reading the sequences from fasta file. RichSequence iterator is doing the job very well. 
I am processing the input sequences one by one. Each RichSequence is annotated and added into a specific group (ArrayList) based on the annotation results. All annotated sequence are kept in memory and re-processed later ... which prevent GC from cleaning the heap.
I can serialize the processed sequences, but IO also have performance issues .

I can inspect the heap with eclipse memory analyzer. SimpleRichSequence object consume a lot of memory.

Best

khalil

-----

Confidentiality Notice: This e-mail and any files transmitted with it are private and confidential and are solely for the use of the addressee. It may contain material which is legally privileged. If you are not the addressee or the person responsible for delivering to the addressee, please notify that you have received this e-mail in error and that any use of it is strictly prohibited. It would be helpful if you could notify the author by replying to it.

On 01 Aug 2013, at 07:58, Amr AL-HOSSARY wrote:

> If your problem is in parsing/loading all the sequences in memory first,
> before managing them, I had created a method public LinkedHashMap<String,S>
> process(int max) in Class FastaReader in BioJava 3.0.6. It reads a maximum
> (max) sequences to parse, then read next sequenes in a subsequent call.
> You can use it. If you need a similar one in Biojava 1, I can make it for
> you.
> 
> Otherwise, you will need to modify your algorithm to deal with smaller
> clusters, based on the task you are doing.
> 
> Amr
> 
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Khalil El
> Mazouari
> Sent: Thursday, August 01, 2013 1:17 AM
> To: Biojava-l at lists.open-bio.org
> Subject: [Biojava-l] Large RichSequence collection
> 
> Hi,
> 
> I have to process large dataset of DNA sequence(>= 120.000 seq). Sequences
> are first annotated, clustered ... I end up with huge collection of
> SimpleRichSequence objects consuming a lot of RAM.
> 
> Any suggestion on how to deal effectively with large collection of
> RichSequence objects is welcome.
> 
> Thanks in advance.
> 
> khalil
> 
> 
> 
> 
> 
> 
> -----
> 
> Confidentiality Notice: This e-mail and any files transmitted with it are
> private and confidential and are solely for the use of the addressee. It may
> contain material which is legally privileged. If you are not the addressee
> or the person responsible for delivering to the addressee, please notify
> that you have received this e-mail in error and that any use of it is
> strictly prohibited. It would be helpful if you could notify the author by
> replying to it.
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l