[Biopython-dev] New Bio.SeqIO code

Thu Nov 2 12:49:30 UTC 2006

Chris Lasher wrote:
> I'd like to pitch in a few comments here.
> 
> Peter wrote:
>> One point against names like File2SequenceIterator is the pun on 
>> two versus to (i.e. convert) will not be so obvious to non-native 
>> English speakers.
> 
> I'd like to second that. It's cute, sure, but FileToSequenceIterator
>  isn't that much more difficult, and leaves no room for confusion. 
> (e.g., Where's the File1SequenceIterator?)

I would be happy with FileToSequenceIterator, or even
FileToSequenceIter.  FileToSeqIter is shorter but we don't actually
return Seq objects so I would avoid that.

Does anyone else have any suggestions?

> Michiel wrote:
>> I like the idea of one argument that takes a file name or handle. I
>>  believe that that is how other Biopython functions work.

I've had a little look, and the only case I found is the recent
Bio.Nexus parser - and this choked on a StringIO handle on my machine
(fix checked in).

Chris Lasher wrote:
> Yikes! Are you serious? Why not make it easier and require a 
> file-like object? I would definitely not be for it taking a plain 
> string. This seems implicit rather than explicit. "Takes a file... or
> a file-like object... or a string containing a filename... or just a
> string containing the file contents... or a brief description of the
> data that's in your file... or a bunch of smiley emoticons, if 
> you're in a good mood..." File-like objects are testable and leave 
> little room for surprise. Anything else seems like it's asking for a 
> headache.

Trying to distinguish between an (invalid) filename and the contents of
a sequence file is just too much to ask - more a migraine than a headache.

As an experiment, I've implemented (but not checked in) automatic
handle/filename detection.  Its seems to work (but I have not yet tried
exotic arguments like file names in Unicode, or random classes with a
__str__ method).  Still its messy.

While it does sound like a nice idea for the end user, the idea of
filenames and handles is pretty important in python, and maybe we
shouldn't worry about forcing newcomers deal with handles.  After all,
the SeqIO system will make them deal with iterators and SeqRecords which
I think are far more complicated!

What do you think Michiel?

Chris Lasher wrote:
> Which brings me to the issue of "guessing" a file's format. Yikes, 
> again! I'd expect that kind of "magickery" from Perl, but once again,
> explicit is better than implicit. I honestly think it's not too much
> to expect the user to know what filetype they're expecting BioPython
> to deal with. Could you guys please explain the motivation behind 
> this to me? As I see it right now, the last thing I want is BioPython
> incorrectly guessing my file format, and particularly, assuming that
> I have put the proper extension to represent the file format. The 
> unified sequence object is what's beautiful about SeqIO, but the
> guesswork that you are discussing having SeqIO's classes do is scary,
> to me.

For comparison this quote is from the BioPerl SeqIO How-To:
>> [BioPerl's] SeqIO can try to guess based on known file extensions 
>> or content, ... it is a good idea to get into the practice of 
>> always specifying the format.

I want to stress that as written, the user can specify the file format
to the File2SequenceIterator function (and its variants).  Maybe we
should encourage people to explicitly supply the format in any Bio.SeqIO
documentation....

You asked about motivation for guessing the file format.  I break that
down into guessing the file format based on the file extension, or based
on the file's contents (see later).

I personally am perfectly happy with using a file extension to file
format mapping.  Maybe this reflects my computing background (more
DOS/Windows background than Unix/Linux).

Note that if the format is not specified, and the file extension is not
on the known list (e.g. "txt" or "data" which could be anything) then
the call to File2SequenceIterator function (or its variants) will fail
with an invalid format message/exception.

Assuming we don't make the format a required argument, and we keep the
extension to format mappings, then I should make a point of including
deliberate miss-matches in the test suits - and check that they abort
with a SyntaxError.

Regarding guessing the format based on file contents:

For some applications, having a format guesser built into BioPython
might actually be very useful - the example given on the BioPerl website
is the back end of a web tool that took sequence input, where maybe you
can't trust the actual end user to know exactly what file format their
data is in.

Doing this for some file formats isn't too hard, often all you need to
see is the first line.  For other file formats its very tricky and best
not attempted.  But, is partial guess support even worth implementing -
especially as it may be less than perfect and get it wrong sometimes?

I think Michiel and I where happy to leave this question for later...

Chris Lasher wrote:
> And I think by now it's predictable that I'm a fan of Peter's 
> suggestion to have an exception raised upon the attempt to create a 
> dictionary with identical IDs; all other options are, again, too 
> implicit for my tastes.

Good.  Michiel agreed in another email:
>> 
>> You're probably right. I'm fine with raising an exception.
>> 

Have you been following the rest of that SeqRecord dictionary discussion
Chris?

> Thanks very much for developing SeqIO and discussing it so much, 
> guys. I think this will be a fantastic asset to BioPython! Keep on 
> rockin' it!
> 
> Chris

Thank you for your passionate feedback :)

Peter