[Biopython] Indexing large sequence files
Peter
biopython at maubp.freeserve.co.uk
Fri Jun 19 11:53:09 UTC 2009
On Fri, Jun 19, 2009 at 12:12 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Either way, having an index file holding even compressed
> pickled versions of SeqRecord objects takes at least three
> times as much space as the original FASTQ file.
>
> So, for millions of records, I am going off the shelve/pickle
> idea. Storing offsets in the original sequence file does seem
> more practical here.
How does this following code work for you? It is all in memory,
no index files on disk. I've been testing it on uniprot_sprot.fasta
which has only 470369 records (this example takes about 8s),
but the same approach also works on a FASTQ file with seven
million records (taking about 1min). These times are to build
the index, and access two records for testing.
#Start of code
from Bio import SeqIO
class FastaDict(object) :
"""Read only dictionary interface to a FASTA file.
Keeps the keys in memory, reads the file to access
entries as SeqRecord objects using Bio.SeqIO."""
def __init__(self, filename, alphabet=None) :
#TODO - Take a handle instead, provided it has
#seek and tell methods?
self._index = dict()
self._alphabet = alphabet
handle = open(filename, "rU")
while True :
pos = handle.tell()
line = handle.readline()
if not line : break #End of file
if line.startswith(">") :
self._index[line[1:].rstrip().split(None,1)[0]] = pos
handle.seek(0)
self._handle = handle
def keys(self) :
return self._index.keys()
def __len__(self) :
return len(self._index)
def __getitem__(self, index) :
handle = self._handle
handle.seek(self._index[index])
return SeqIO.parse(handle, "fasta", self._alphabet).next()
import time
start = time.time()
my_dict = FastaDict("uniprot_sprot.fasta")
print len(my_dict)
print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first
print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last
print "Took %0.2fs" % (time.time()-start)
#End of code
Peter
More information about the Biopython
mailing list