[Bioperl-l] Placement of LargePrimarySeq

David Block dblock@gene.pbi.nrc.ca
Mon, 18 Sep 2000 13:14:09 -0600 (CST)


We've been working on chromosome-level for a while now, with Arabidopsis
(chr 2 and 4 were released last year).  What I did (in pure perl
fashion) was to read the sequences into a string once, then use a simple
regex to save the sequence in lines with an equal number of characters in
each line ( I think I used 70).  Then I saved that file (chr2.clean).  I
read in chr2.clean and store it as an array of lines, so finding any given
sequence is a simple arithmetical calculation and an array lookup.  We are
working with 128 or 256 Mb RAM machines here, and everything is fine.

I even store the array in a closure so multiple sequence objects can read
from it at the same time.  It takes about 10 seconds or so to load up on
my machine, but then it's there for the life of the process, and
retrieving subsequences takes no time.

Wanna see the code?  It's not SeqIO compliant, but it's part of my
Sequence object which implements Bio::SeqI.  It depends for now on knowing
the file location, which is bad.  All of this will be part of the final
version of Workbench, which will run on perl/MySQL.

HTH,

Dave "It's Monday" Block

On Mon, 18 Sep 2000, James Gilbert wrote:

> 
> 
> Ewan,
> 
> I've looked at the problem, and it isn't where I
> thought it was in the code.
> 
> I made a test sequence 40Mbp long.  I can read it
> into a string, but when I try to copy the string,
> I get the "Out of memory!" error.  (And this is on
> a machine with 1Gb RAM).
> 
> Perhaps Perl's memory allocator is calculating a
> silly number.  It might be possible to write a
> PrimarySeqI object as a C extension, with a more
> conserative memory allocaion scheme.
> 
> 	James
> 
> On Mon, 18 Sep 2000, James Gilbert wrote:
> 
> > Ewan,
> > 
> > This reminds me that I should put in a fix I've
> > thought of in SeqIO::fasta to stop the memory
> > exploding on very large sequences.
> > 
> > 	James
> > 
> > On Sun, 17 Sep 2000, Ewan Birney wrote:
> > 
> > > 
> > > Tomorrow I have to do some comparisons of very large sequence files
> > > (around chromosome 1 size, if people are interested...). Although I could
> > > potentially use bioperl sequences on a machine with a huge amount of real
> > > memory, I decided to make a quick module that stores a sequence a
> > > file in /tmp/ and then executes the subseq command be using seek and read
> > > commands.
> > > 
> > > I have this object as Bio::LargePrimarySeq. Does anyone have any
> > > objections about having this object in the Bio:: area directly or should
> > > I put it somewhere else (bascially, what do people feel about cluttering
> > > up the top level Bio:: area, or should I make a Bio::Seq:: directory. 
> > > NB - there might be some other extensions, like Bio::CachePrimarySeq which
> > > can cache subseq calls to improve performance for LargePrimarySeq and
> > > the Ensembl database equivalents...)
> > > 
> > > 
> > > I need to write a SeqIO system for making this and also writing out very
> > > large fasta files. (it should step through the sequence one MB at a time
> > > using the subseq method, rather than getting the whole thing out as a 
> > > seq). Options:
> > > 
> > > 	(a) make a new Bio::SeqIO::bigfasta module, and ->next_seq would
> > > make sequences with LargePrimarySeq and ->write_seq would write with
> > > this subseq method
> > > 
> > > 	(b) parameterise Bio::SeqIO::fasta for both of these. (have to 
> > > handle boring don't use $/ stuff as reading can't put everything between
> > > '>' as a string, as the whole point is not to have the entire sequence as
> > > a string in memory)
> > > 
> > > I prefer (a) to (b).
> > > 
> > > 
> > > 
> > > I got to do this tomorrow, so if people have a view, make sure that view
> > > gets back to me soon....
> > > 
> > > 
> > > 
> > > 
> > > Of course this is all main trunk stuff, not on the branch.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -----------------------------------------------------------------
> > > Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> > > <birney@ebi.ac.uk>. 
> > > -----------------------------------------------------------------
> > > 
> > > 
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > 
> > 
> > James G.R. Gilbert
> > The Sanger Centre
> > Wellcome Trust Genome Campus
> > Hinxton
> > Cambridge                        Tel: 01223 494906
> > CB10 1SA                         Fax: 01223 494919
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > 
> 
> James G.R. Gilbert
> The Sanger Centre
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge                        Tel: 01223 494906
> CB10 1SA                         Fax: 01223 494919
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>