[Biopython] fastq manipulations speed
Peter Cock
p.j.a.cock at googlemail.com
Sun Mar 17 21:24:33 UTC 2013
On Sun, Mar 17, 2013 at 8:22 PM, Chris Mitchell <chris.mit7 at gmail.com> wrote:
> Hi Natassa,
>
> First, I wouldn't bother indexing. This seems a one-and-done operation and
> indexing is thus a waste of time. Have the list of stuff you want to find
> first, then iterate through the fasta file looking for what you want.
You might be able to do a paired iteration between the trimmed
FASTA file and the untrimmed quality file. I'll reply separately
with comments on the current code...
> One comment on the code that will speed it up:
> don't use if record in fq_dict.keys(). That returns a list which is going
> to have a lookup time proportional to the list size. Do:
> fq_keys = set(fq_dict.keys()) and then if record in fq_keys, this will be
> O(1) lookup time.
>
> Chris
That's an excellent point, but both dictionaries and sets use
hash based lookups for speed, and should be about the same.
i.e. instead of this:
if record in fq_dict.keys():
#do stuff
Use this:
if record in fq_dict:
#do stuff
That is also considered better style. Another related point,
rather than:
for record in fasta_dict.keys():
#do stuff
this would typically be written as:
for record in fasta_dict:
#do stuff
In this case it would be a little faster since there is no need
to run the keys method, but will do the same thing.
Peter
More information about the Biopython
mailing list