[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Jul 31 09:59:47 UTC 2006

Hi all,

On Fri, 2006-07-28 at 16:05 +0100, Peter (BioPython Dev) wrote:
> Jeffrey Chang wrote:
> > ...  However, Biopython already has at least 
> > 3 Fasta parsers!
> >    Bio/Fasta
> >    Bio/SeqIO/FASTA
> >    Bio/expressions/fasta
> > 
> > Bio/Fasta, the one you compared against, is easily the slowest one.  
> > Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> > to be significantly faster or slower.  Bio/expressions/fasta uses 
> > Martel.  I don't know how well that will perform.  The parsing part 
> > should be blazingly fast (since it is mostly in C), but building the 
> > object will be slow.  It might be a wash.

Just to add to the confusion, when parsing large FASTA sequence files, I
have been using a home-rolled Flex/Pyrex parser (if you'd like a copy,
drop me a line).  I've used Peter's test framework on the same input
file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora
Core 3 (up-to-date, eh? ;) ) to get the following typical results:

4.07s FormatIO/SeqRecord (for record in interator)
4.05s FormatIO/SeqRecord (iterator.next)
0.32s SeqIO.FASTA.FastaReader (for record in interator)
0.30s SeqIO.FASTA.FastaReader (iterator.next)
0.31s SeqIO.FASTA.FastaReader (iterator[i])
5.53s Fasta.RecordParser (for record in interator)
5.00s Fasta.SequenceParser (for record in interator)
4.80s Fasta.SequenceParser (iterator.next)
0.18s SeqUtils/quick_FASTA_reader
0.11s pyfastaseqlexer/next_record
0.09s pyfastaseqlexer/quick_FASTA_reader
0.19s SeqUtils/quick_FASTA_reader (conversion to Seq)
0.14s pyfastaseqlexer/next_record (conversion to Seq)
0.11s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
0.17s pyfastaseqlexer/next_record (conversion to SeqRecord)
0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

pyfastaseqlexer is my Flex/Pyrex combination, which has a number of
methods for reading in FASTA sequences.  Here I've used the two that
correspond to the Bio.SeqUtils.quick_FASTA_reader method (overlooked in
the original list, but also included here for comparison), and Peter's
iterator method for his tests.  Since these extra methods don't return
Bio.Seq or Bio.SeqRecord objects, but instead lists of (name, sequence)
tuples, I've also included test functions that carry out the conversion
in Python, and their timings.

It's probably not a surprise that a dedicated Flex-based parser shows
such a dramatic speed improvement over the Martel-based parsers.  The
improvement over SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader
is only marginal, though (a factor of approximately two when conversion
to SeqRecord is taken into account).  

Since we've been discussing the need to use only strings to represent
sequences recently, it's interesting to note that
SeqUtils.quick_FASTA_reader is about twice as fast as
SeqIO.FASTA.FastaReader if there is no conversion of sequences from
strings to Seq or SeqRecord objects.

While the Flex-based parser is the fastest in these tests, the time
saved is marginal unless a large FASTA file is being parsed.  Using a
file with over 72000 entries (Phytophthora infestans ESTs), my typical
timings become:

51.22s FormatIO/SeqRecord (for record in interator)
45.64s FormatIO/SeqRecord (iterator.next)
4.26s SeqIO.FASTA.FastaReader (for record in interator)
4.10s SeqIO.FASTA.FastaReader (iterator.next)
4.30s SeqIO.FASTA.FastaReader (iterator[i])
58.39s Fasta.RecordParser (for record in interator)
59.97s Fasta.SequenceParser (for record in interator)
58.70s Fasta.SequenceParser (iterator.next)
2.20s SeqUtils/quick_FASTA_reader
1.13s pyfastaseqlexer/next_record
0.56s pyfastaseqlexer/quick_FASTA_reader
2.20s SeqUtils/quick_FASTA_reader (conversion to Seq)
1.53s pyfastaseqlexer/next_record (conversion to Seq)
0.84s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
2.11s pyfastaseqlexer/next_record (conversion to SeqRecord)
1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

The Martel-based parsers become almost unworkable when dealing with
files of this size.  Note that the conversion of strings to SeqRecord
objects is pretty much a constant overhead for the Bio.SeqUtils and
pyfastaseqlexer methods (taking around 1s), but that there are
apparently additional overheads in the SeqIO.FASTA.FastaReader method.

Of course, the hassles of including a Flex-based parser in a general
BioPython release probably outweigh the marginal time-saving benefits
(see MMCIFlex for details ;) ).  I think SeqIO.FASTA.FastaReader and
SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and beat
the inclusion of a Flex-based parser hands-down in terms of
maintainability and portability.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).