[Biopython-dev] [BioPython] Bio.SCOP.FileIndex

Michiel de Hoon mjldehoon at yahoo.com
Sat Jun 28 02:21:53 UTC 2008


--- On Fri, 6/27/08, Peter <biopython at maubp.freeserve.co.uk> wrote:
Are
you talking about Bio/SCOP/FileIndex.py? The whole design seems to
begeared to indexing the position of record in a file - down to the fact that it takes as filename rather than a handle. Why does it need "fixing"?

FileIndex pulls out records from the iterator one by one, and then calls .tell() on the file handle to find the starting position of each record. The problem is that (due to buffered reading from the file handle) .tell() does not correspond to the record starting positions.

Taking the essential pieces of FileIndex:

>>> input = open("mydatafile.txt")
>>> while True:
...     next_line = input.next()
...     print input.tell()
... 
8192
8192
8192
8192
8192
...
8192
8192
18432
18432
18432
...

It works because in the iterators that are actually used in Bio.SCOP call readline() internally, which reads exactly one line so that .tell() returns the expected answer.
But, calling readline() in the iterator is a limitation (e.g., you cannot run it on a list of lines).

Another option is to let FileIndex itself call readline():

class FileIndex(dict):
    def __init__(self, filename, record_gen, key_gen)
        ...
        f = open(filename)
        while True:
            line = f.readline()
            self[key] = f.tell() # store location
...
    def __getitem__(self, key):
        location = dict.__getitem__[key]
        f.seek(location)
        line = f.readline()
        return record_gen(line)

This works, but it means changing how users call FileIndex.
Which is also OK, but before modifying FileIndex it would be good to know if anybody is actually using this functionality.

--Michiel.



      



More information about the Biopython-dev mailing list