[Biopython-dev] Cleaning up Bio.SeqUtils

Fri Sep 26 09:38:57 UTC 2008

On Thu, Sep 25, 2008 at 11:47 PM, Thomas Sicheritz-Ponten
<thomas at cbs.dtu.dk> wrote:
> Peter, can you check in the corrected version of quick_FASTA_reader for me?
> I added the changes which were suggested in earlier posts (changes not
> affecting speed and simplicity)
>
> def quick_FASTA_reader(file):
>    "simple and quick FASTA reader to be used on large FASTA files"
>    from os import linesep
>    txt = open(file).read()
>    entries = []
>    splitter = "%s>" % linesep
>    for entry in txt.split(splitter):
>        name,seq= entry.split(linesep,1)
>        if name[0]=='>': name = name[1:]
>        seq = seq.replace('\n','').replace(' ','').upper()
>        entries.append((name, seq))
>    return entries

I'm pretty sure we shouldn't be using os.linesep in this way.  I'd
have to double check on a Windows box to confirm this, but I believe
from memory that any CRLF in the file becomes just a \n in python.

The basic idea is we want to split on "\n>" so that any additional ">"
inside a name are ignored.  This than means the first record in the
file is a special case.  You've also added an extra if statement in
the loop - I assume to cope with the fact that using a split on "\n>"
would leave a leading ">" on the first record's name -- but this would
go wrong if the name itself started with a ">" too (i.e. a line
starting with ">>..." which would be unusual).

Perhaps instead, as a typical FASTA file starts immediately with ">"
we can just do the split on "\n"+contents of file.  I've updated CVS
based on this, and added a minimal test for quick_FASTA_reader (and
GC) to test_SeqUtils.py as well.

Checking in Bio/SeqUtils/__init__.py;
/home/repository/biopython/biopython/Bio/SeqUtils/__init__.py,v  <--
__init__.py
new revision: 1.17; previous revision: 1.16
done
Checking in Tests/test_SeqUtils.py;
/home/repository/biopython/biopython/Tests/test_SeqUtils.py,v  <--
test_SeqUtils.py
new revision: 1.2; previous revision: 1.1
done
Checking in Tests/output/test_SeqUtils;
/home/repository/biopython/biopython/Tests/output/test_SeqUtils,v  <--
 test_SeqUtils
new revision: 1.2; previous revision: 1.1
done

Could you have a look at Bio/SeqUtils/__init__.py revision 1.17 for
review?  It will be up on ViewCVS shortly...
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqUtils/__init__.py?cvsroot=biopython

Do you think I should remove the "OBSOLETE" tag in the docstring for
the quick_FASTA_reader function?

> Concerning the seq3 function, I am not sure where it came from, I don't
> think I have added it.

OK, thanks.

Peter