[Bioperl-l] Memory-mapped sequence object
Jason Stajich
jason@cgt.mc.duke.edu
Mon, 12 Aug 2002 08:49:35 -0400 (EDT)
Re: Bio::DB::Fasta
In that it does a good job doing the lookups on large sequences, but will
suffer just as much as the current implementation when you try and bring a
large sequence all the way into memory -
my $reallylargeseq = $db->seq(CHROMOSOME_I, 1 => 10_000_000);
But a memory mapped seq object would still be a good thing. If this could
replace the slow Bio::Seq::LargePrimarySeq implementation I'd love to see
it in the toolkit.
-jason
On Sun, 11 Aug 2002, Jeremy Semeiks wrote:
> Hi list,
>
> I'm working on a project that requires fast random access to
> subsequences of the UCSC human draft chromosomes, and I'd like to use
> the Bioperl toolset to retrieve and manipulate these subsequences. At
> first glance, the Bio::Seq::LargePrimarySeq class (and its associated
> SeqIO::largefasta module) might seem to handle this
> situation. However, LargePrimarySeq::subseq() is actually pretty slow
> at retrieving random subsequences because it relies on file-based
> access using seek() and friends.
>
> One way of solving this problem would be to implement a sequence
> object that accesses a memory-mapped Fasta file through a module such
> as Sys::Mmap. Random subsequence access is then blazingly fast, if the
> sequence file is laid out right. The biggest drawback that I can see
> to this approach is that it's not stream-based. In particular, for
> fast retrieval to work correctly, the memory-mapped file must contain
> exactly one Fasta sequence, and the sequence columns must all be the
> same width. But this is exactly the way in which the UCSC chromosome
> files are laid out.
>
> I've already started work on a memory-mapped sequence object and Fasta
> IO object (tentatively named Bio::Seq::Huge[Primary]Seq and
> Bio::SeqIO::hugefasta, respectively). I'm using LargePrimarySeq and
> largefasta as general templates for this -- as with the LargeSeq set
> of objects, I can't quite see any way with memory-mapping to fully
> separate the interface details from the sequence. Comments and advice
> are welcome. In particular, is there a better solution to my problem
> than memory-mapping? Am I overlooking any solutions that have already
> been implemented in Bioperl or elsewhere? If not, has anyone else seen
> a need for fast subsequence access from huge files?
>
> - Jeremy
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu