[BioPython] poor man's databases for large sequence files
    Peter 
    biopython at maubp.freeserve.co.uk
       
    Mon Sep 24 21:47:13 UTC 2007
    
    
  
I've been thinking about extending Bio.SeqIO to support a (read only) 
dictionary like interface for large sequence files (WITHOUT having 
everything in memory).
Some of the older Biopython sequence format specific modules have an 
index_file function and matching Dictionary class to do this (based 
internally on either Martel/Mindy or a DIY Biopython indexer based on 
pickle).
When thinking about a format agnostic SeqRecord dictionary, the built in 
python "Shelf" object from python's built in "shelve library" looks like 
a good choice.  I could add a Bio.SeqIO.to_shelf() function similar to 
the existing Bio.SeqIO.to_dict() function.
The only downside I've thought of so far is updating a shelf database, 
something supported by shelve but with a few gotchas when dealing with 
non-trivial datatypes (like dictionaries).  The need I am thinking about 
addressing is a little less flexible - read only low-memory access to a 
large collection of SeqRecords (typically from a large sequence file).
Does anyone already use python's shelve library with sequence data?
Peter
    
    
More information about the Biopython
mailing list