[Biopython-dev] New Bio.SeqIO code

Tue Nov 14 00:49:02 UTC 2006

Iddo Friedberg wrote:
> 3) The last argument against rigid filename extensions is 
> interoperability with other applications that generate those files. 
> Suppose you have one application that generates fasta files with a
> .tfa extension, and another with a .fa extension and yet a third with
> .pfa extensions... and those extensions are important to you for
> other reasons, like knowing which is a nucleic acid file and which is
> protein. Actually, all the NCBI genomic files are built like this...
> :)

Interesting tidbit.

If you are using "exotic" file extensions, then you would have to
explicitly tell my Bio.SeqIO code the file's format.

Although "fa" is currently a known extension mapped to fasta format in
Bio.SeqIO, your other examples are not.  Are these other extensions used
outside the internal systems of the NCBI?

> OK, three arguments. I think that relying on filename extensions for
> content is rather DOS-ish and places an extra burden on the user.

I'm not trying to force anyone into using specific filename extensions -
  I'm trying to make life easier for people who already do this (or who
download their data from online sources like the NCBI or PFAM - which do
seem to be consistent in their naming conventions).

> I'm suffering enough on my Windows machine with Rasmol trying to open
> all my .pdb files. Including those where pdb stands for "Palm Pilot
> database" rather than Protein Data Bank.

Yes - multiple interpretations of a given file format are a problem.
I've noticed that same PDB extension clash too (but I don't use a Palm
pilot any more).

Can anyone think of any common extensions used for more than one file
format?  I know Clustal uses *.aln for its alignments which is perhaps
asking for trouble...

> We could add the format as a OPTIONAL keyword argument, with a "None"
> default value. And have the parser recognize the format from a
> lookahead using a magic regexp fro each format. The user passed
> format overrides the parser guesswork. Shouldn't be too  hard to
> implement, as file formats are very distinct.

Currently the format is an optional keyword argument defaulting to None.
When it is omitted, I currently use a limited filename extension to
format mapping (assuming the filename is available) to deduce/guess the
format.

Peter