[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Jul 31 14:15:54 UTC 2006

On Mon, 2006-07-31 at 13:12 +0100, Peter (BioPython Dev) wrote:
> I imagine this file is much much larger than what most of our uses work 
> with - but it does clearly show that the Martel parsers do not scale well.

I noticed the scaling problem mostly for GenBank files.  Your new
GenBank parser is a welcome improvement in speed.

> Out of interest, are the sequences in this file split into multiple 
> lines (e.g. max length 80) or are they all single (long) lines?  I would 
> expect the later to be quicker to load due to less string operations.

They're multiple lines with max length 50, and the whole file is 33Mb.
It's not the largest FASTA sequence file I'm working with, that's 353Mb
(530801 sequences, it's most of a eukaryotic genome with sequences split
into multiple lines), so I ran your test script on it, just to see what
happened:

419.42s FormatIO/SeqRecord (for record in interator)
389.05s FormatIO/SeqRecord (iterator.next)
35.46s SeqIO.FASTA.FastaReader (for record in interator)
33.73s SeqIO.FASTA.FastaReader (iterator.next)
36.19s SeqIO.FASTA.FastaReader (iterator[i])
490.19s Fasta.RecordParser (for record in interator)
555.43s Fasta.SequenceParser (for record in interator)
546.87s Fasta.SequenceParser (iterator.next)
37.94s SeqUtils/quick_FASTA_reader
12.84s pyfastaseqlexer/next_record
6.06s pyfastaseqlexer/quick_FASTA_reader
24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
12.27s pyfastaseqlexer/next_record (conversion to Seq)
8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

This is only one run - my patience has limits <grin>  Again, scaling is
a big problem for some methods.

> The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
> entire file into memory in one go, and then parses it.  On the other 
> hand its not perfect: I would use "\n>" as the split marker rather than 
> ">" which could appear in the description of a sequence.

I agree (not that it's bitten me, yet), but I'd be inclined to go with
"%s>" % os.linesep as the split marker, just in case.

> Do we need to worry about the size of the raw file in memory - allowing the parsers to load it 
> into memory could make things much faster...

I use very few FASTA files where that would be a problem, so long as the
sequences remain as strings - when they're converted to
SeqRecords/SeqFeatures is where I start to get nervous about memory use.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).