[Biopython-dev] Cleaning up Bio.SeqUtils

Fri Sep 26 09:54:12 UTC 2008

Ok, fair enough :-)
Please remove also the OBSOLETE tag - as Bio.SeqIO.parse is not really a 
substitution for quick_FASTA_reader

cheers
-thomas


Peter wrote:
> On Thu, Sep 25, 2008 at 11:47 PM, Thomas Sicheritz-Ponten
> <thomas at cbs.dtu.dk> wrote:
>> Peter, can you check in the corrected version of quick_FASTA_reader for me?
>> I added the changes which were suggested in earlier posts (changes not
>> affecting speed and simplicity)
>>
>> def quick_FASTA_reader(file):
>>    "simple and quick FASTA reader to be used on large FASTA files"
>>    from os import linesep
>>    txt = open(file).read()
>>    entries = []
>>    splitter = "%s>" % linesep
>>    for entry in txt.split(splitter):
>>        name,seq= entry.split(linesep,1)
>>        if name[0]=='>': name = name[1:]
>>        seq = seq.replace('\n','').replace(' ','').upper()
>>        entries.append((name, seq))
>>    return entries
> 
> I'm pretty sure we shouldn't be using os.linesep in this way.  I'd
> have to double check on a Windows box to confirm this, but I believe
> from memory that any CRLF in the file becomes just a \n in python.
> 
> The basic idea is we want to split on "\n>" so that any additional ">"
> inside a name are ignored.  This than means the first record in the
> file is a special case.  You've also added an extra if statement in
> the loop - I assume to cope with the fact that using a split on "\n>"
> would leave a leading ">" on the first record's name -- but this would
> go wrong if the name itself started with a ">" too (i.e. a line
> starting with ">>..." which would be unusual).
> 
> Perhaps instead, as a typical FASTA file starts immediately with ">"
> we can just do the split on "\n"+contents of file.  I've updated CVS
> based on this, and added a minimal test for quick_FASTA_reader (and
> GC) to test_SeqUtils.py as well.
> 
> Checking in Bio/SeqUtils/__init__.py;
> /home/repository/biopython/biopython/Bio/SeqUtils/__init__.py,v  <--
> __init__.py
> new revision: 1.17; previous revision: 1.16
> done
> Checking in Tests/test_SeqUtils.py;
> /home/repository/biopython/biopython/Tests/test_SeqUtils.py,v  <--
> test_SeqUtils.py
> new revision: 1.2; previous revision: 1.1
> done
> Checking in Tests/output/test_SeqUtils;
> /home/repository/biopython/biopython/Tests/output/test_SeqUtils,v  <--
>  test_SeqUtils
> new revision: 1.2; previous revision: 1.1
> done
> 
> Could you have a look at Bio/SeqUtils/__init__.py revision 1.17 for
> review?  It will be up on ViewCVS shortly...
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqUtils/__init__.py?cvsroot=biopython
> 
> Do you think I should remove the "OBSOLETE" tag in the docstring for
> the quick_FASTA_reader function?
> 
>> Concerning the seq3 function, I am not sure where it came from, I don't
>> think I have added it.
> 
> OK, thanks.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


-- 
Sicheritz-Ponten Thomas, Associate Professor, Ph.D       (
Head of Metagenomics, Technical University of Denmark     \
Center for Biological Sequence Analysis, BioCentrum        )
CBS: +45 45 252422      Building 208, DK-2800 Lyngby  ##----->
Fax: +45 45 931585      http://www.cbs.dtu.dk/~thomas      )
                                                           /
      ... damn arrow eating trees ...                     (