[Bioperl-l] Memory-mapped sequence object

Sun, 11 Aug 2002 23:44:43 -0700

Hi list,

I'm working on a project that requires fast random access to
subsequences of the UCSC human draft chromosomes, and I'd like to use
the Bioperl toolset to retrieve and manipulate these subsequences.  At
first glance, the Bio::Seq::LargePrimarySeq class (and its associated
SeqIO::largefasta module) might seem to handle this
situation. However, LargePrimarySeq::subseq() is actually pretty slow
at retrieving random subsequences because it relies on file-based
access using seek() and friends.

One way of solving this problem would be to implement a sequence
object that accesses a memory-mapped Fasta file through a module such
as Sys::Mmap. Random subsequence access is then blazingly fast, if the
sequence file is laid out right. The biggest drawback that I can see
to this approach is that it's not stream-based. In particular, for
fast retrieval to work correctly, the memory-mapped file must contain
exactly one Fasta sequence, and the sequence columns must all be the
same width. But this is exactly the way in which the UCSC chromosome
files are laid out.

I've already started work on a memory-mapped sequence object and Fasta
IO object (tentatively named Bio::Seq::Huge[Primary]Seq and
Bio::SeqIO::hugefasta, respectively). I'm using LargePrimarySeq and
largefasta as general templates for this -- as with the LargeSeq set
of objects, I can't quite see any way with memory-mapping to fully
separate the interface details from the sequence. Comments and advice
are welcome. In particular, is there a better solution to my problem
than memory-mapping? Am I overlooking any solutions that have already
been implemented in Bioperl or elsewhere? If not, has anyone else seen
a need for fast subsequence access from huge files?

- Jeremy