[Biopython] matching headers and then writing the seq record

Dilara Ally dilara.ally at gmail.com
Tue Jul 31 18:53:27 UTC 2012


Thanks Peter it sped it up considerably!  I appreciate the fast replies on this listserv.


On Jul 28, 2012, at 1:48 PM, Peter Cock wrote:

> On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> ... It seems as if set undoes the elegance of using a generator.
>> Any advice is greatly appreciated! ...
>> 
>> headers_read1 = set(...)
>> headers_read2 = set(...)
>> header_matches = [x for x in headers_read1 if x in headers_read2]
> 
> I would expect that using the built in set's intersection operation would
> be faster than this list comprehension solution to create header_matches.
> 
> Also, you should use a set not a list for header_matches because testing
> membership with a set is much faster than a list. i.e. Try:
> 
> header_matches = headers_read1.intersection(headers_read2)
> 
> This might be a tiny change, but I expect it to be noticeably faster.
> 
> Also, here:
> 
>> def matched_records(records, pairType, header_matches):
>>   for rec in records:
>>       id = get_header(rec)
>>       result = id in header_matches
>>       if (result == True):
>>           newrec = replace_header(rec,pairType)
>>           yield newrec
> 
> If you don't mind my style comments, you don't really need
> to create the variables 'id' and  'result', and 'newrec' - I would
> just do:
> 
> def matched_records(records, pairType, header_matches):
>    for rec in records:
>        if get_header(rec) in header_matches:
>            yield replace_header(rec,pairType)
> 
> And at that point you could write the whole thing as a
> generator expression, which you may or may not find
> more pleasing (I'm not sure if it makes any significant
> difference to the speed). i.e.
> 
> records = SeqIO.parse(sys.argv[1], "fastq")
> pairType = 1
> wanted = (replace_header(rec,pairType) \
>                 for rec in records \
>                 if get_header(rec) in header_matches)
> count = SeqIO.write(wanted, sys.argv[3], "fastq")
> 
> I hope that helps,
> 
> Peter





More information about the Biopython mailing list