[Biopython-dev] New Bio.SeqIO code

Sun Oct 29 06:09:14 UTC 2006

Well let's first decide which functions we want in Bio.SeqIO, and then 
decide how to name them.

I'm fine with the idea of having a function that can guess the file 
format from the extension. I also agree that a parser that can guess the 
file format from the file contents is not needed at this point.

 > That was one thing I wanted to discuss - having a SequenceDict and
 > SequenceList class would let us add doc strings and perhaps methods
 > like maxlength, minlength, totallength, ...
 >
 > Or, I can just use simple list and dict objects in the functions
 > File2SequenceList and File2SequenceDict.
 >
 > I have no strong preference on this issue - so unless someone else
 > speaks up, I'll go back to simple lists and dictionaries - keeps
 > things simple.

If we go back to simple lists and dictionaries, do we still need the 
functions File2SequenceList and File2SequenceDict? I'd like to avoid 
software bloat as much as possible, so if we don't need these two 
functions, so much the better.

About file handles:

 > The File2SequenceIterator() function (and friends) can take a
 > filename, handle, or a string containing the contents of a file (in
 > addition to the format).  However, these are done as three separate
 > arguments.
 >
 > I could have one argument that takes a file name or handle, and works 
 > it out on its own.  Bio.Nexus tries to do this for example.  Having
 > the individual iterators also do this trick would be pretty simple
 > (using a shared utility function).
 >
 > The "contents of a file" string argument was handy when testing, but I
 > imagine this is not going to be a common situation.  If people need
 > this, they can use python's StringIO module to turn their data string
 > into a handle easily enough.

I like the idea of one argument that takes a file name or handle. I 
believe that that is how other Biopython functions work.

--Michiel.

Peter wrote:
> Michiel de Hoon wrote:
>> Thanks, Peter!
>> It looks very nice. Actually, I have been using an earlier version of 
>> the new SeqIO module (from your code on Bugzilla) and found it to work 
>> quite well.
> 
> Thank you - and good to here the (old version) is working OK.
> 
>  > A few short comments:
>>
>> To parse a Fasta file using the new SeqIO looks like this:
>>
>> from Bio.SeqIO import File2SequenceIterator
>> for record in File2SequenceIterator("example.fasta") :
>>      print record.id
>>      print record.seq
>>
>> I would rather have something like this:
>>
>> from Bio.SeqIO import Fasta
>> for record in Fasta.parse(open("example.fasta")):
>>      print record.id
>>      print record.seq
>>
>> where Fasta.parse returns a FastaIterator object, and the argument is 
>> either a file object or a file name.
> 
> I think you have raised two issues - file names/handles (discussed 
> below), and the use of a generic function versus a format specific one 
> (or at least the naming conventions).
> 
> I like the idea of a generic function File2SequenceIterator() which can 
> be used on lots of different file formats, just by changing the 
> arguments.  However, there is nothing to stop you using the underlying 
> format specific iterators directly:
> 
> from Bio.SeqIO.FastaIO import FastaIterator
> for record in FastaIterator(open("example.fasta")):
>      print record.id
>      print record.seq
> 
> (which is similar to your suggestion above)
> 
> As long as you don't need to use any file format specific options, then 
> for every file format the style of the code is the same - but switching 
> file formats takes a little more work:
> 
> from Bio.SeqIO.NexusIO import NexusIterator
> for record in NexusIterator(open("example.nexus")):
>      print record.id
>      print record.seq
> 
> versus:
> 
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.nexus") :
>      print record.id
>      print record.seq
> 
> or, to give an example where the file extension is no use and the format 
> must be explicitly stated:
> 
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
>      print record.id
>      print record.seq
> 
> I expect the "helper functions" like File2SequenceIterator() to be used 
> for the simple cases where the user does not care about the minor 
> options we might offer for individual file formats (this would cover 
> beginners).
> 
> They are also nice for writing multiple file format test cases ;)
> 
> I see later in you email you suggested a generic Bio.SeqIO.parse(file) 
> function which would cope with multiple file formats.  Was your point 
> more about what we call things?
> 
> I'm happy to go from File2SequenceIterator() to something like 
> SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() - 
> with matching versions like SeqList() and SeqDict()
> 
> However, I'm not so keen on "parse()" because it gives no clue as to 
> what it will return.
> 
>                                ---
> 
> On the other point, filenames/handles.  Right now, the individual 
> iterators only take a handle.  This was a simplification I made to make 
> my life as straight forward as possible.
> 
> The File2SequenceIterator() function (and friends) can take a filename, 
> handle, or a string containing the contents of a file (in addition to 
> the format).  However, these are done as three separate arguments.
> 
> I could have one argument that takes a file name or handle, and works it 
> out on its own.  Bio.Nexus tries to do this for example.  Having the 
> individual iterators also do this trick would be pretty simple (using a 
> shared utility function).
> 
> The "contents of a file" string argument was handy when testing, but I 
> imagine this is not going to be a common situation.  If people need 
> this, they can use python's StringIO module to turn their data string 
> into a handle easily enough.
> 
>  > You can in addition have a function
>> Bio.SeqIO.parse that guesses the file type from the file name 
>> extension (as you have now for File2SequenceIterator), though that 
>> wouldn't work for file handles.
> 
> When dealing with a file handle, converting it to an undo file handle 
> would probably work - if we had code to guess the file format.  I have 
> tried to raise a syntax error when a parser is given an invalid file - 
> which would mean we could just try some common file formats in order 
> until one works without a syntax error.
> 
> But I felt this was not needed right away, so I put it off.
> 
>> On a related note, I don't think we need the SequenceList and 
>> SequenceDict class. To make a list, one can do
>>
>> from Bio.SeqIO import Fasta
>> records = [record for record in Fasta.parse(open("example.fasta"))]
> 
> Currently that would be written:
> 
> from Bio.SeqIO.FastaIO import FastaIterator
> records = [record for record in FastaIterator(open("example.fasta"))]
> 
> Or even just the following, which I find simpler:
> 
> from Bio.SeqIO.FastaIO import FastaIterator
> records = list(FastaIterator(open("example.fasta")))
> 
> Versus the alternatives:
> 
> from Bio.SeqIO import File2SequenceList
> records = File2SequenceList("example.fasta")
> 
> from Bio.SeqIO import File2SequenceDict
> record_dict = File2SequenceDict("example.fasta")
> 
>> To convert an iterator to a dictionary takes one line more, and is 
>> probably more straightforward than SequenceDict.
> 
> That was one thing I wanted to discuss - having a SequenceDict and 
> SequenceList class would let us add doc strings and perhaps methods like 
> maxlength, minlength, totallength, ...
> 
> Or, I can just use simple list and dict objects in the functions 
> File2SequenceList and File2SequenceDict.
> 
> I have no strong preference on this issue - so unless someone else 
> speaks up, I'll go back to simple lists and dictionaries - keeps things 
> simple.
> 
> Peter
>