[Biopython] Shuffler for wrapped fastas

Mon Sep 21 09:20:44 UTC 2009

Hi Peter,

Thanks for the pointer to FastqGeneralIterator, I'll definitely take a 
look. (Good point about interlacing as well.)

Cheers,
Tanya

Peter wrote:
> Hi Tanya,
> 
> Sorry for the slight delay - your email didn't appear in my inbox
> for a couple of days. Odd.
> 
> On Wed, Sep 16, 2009 at 1:01 PM, Tanya Golubchik wrote:
>> A single write is definitely better, though it's still so much slower than
>> plain text shuffling that it's not ideal for millions of short reads unless
>> we want to do something useful like convert the scores to Phred in the
>> process. In that case we'd be using 'format' anyway, I assume, unless there
>> is a neat trick to reformat a whole lot of reads at once.
> 
> If guess you mean the SeqRecord's format method? It isn't intended
> for output of multiple records to a file, but rather is a convenience
> method for a single record. The (slow) approach using a loop and
> many calls to SeqRecord.format(...) is also less general than using
> a single call to Bio.SeqIO.write(...). Consider interleaved file formats
> or those with a header - the for loop won't work here.
> 
> Using Bio.SeqIO and combining the parse and write functions already
> allows simple conversion of a range of sequence file formats, including
> the three FASTQ variants. This is covered in the tutorial and the wiki,
> http://biopython.org/wiki/SeqIO
> 
> The soon to be released Biopython 1.52 will make this even easier
> (and in some cases like FASTQ conversion also faster) with the
> addition of a Bio.SeqIO.convert(...) function.
> 
>> In general I find myself using Biopython for longer sequences (fasta or
>> fastq), because of the neatness and flexibility of Biopython, but sticking
>> to plain text for short reads because of the overheads.
> 
> In some cases that is the best thing to do. If you haven't already
> done so, have a look at the FastqGeneralIterator function in
> Bio.SeqIO.QualityIO which returns a tuple of three strings (so
> no overhead from Seq and SeqRecord objects).
> 
>> BTW, itertools.izip does exactly what your interleave method
>> does, so I'm not sure there's any need to rewrite it.
> 
> No it doesn't. The builtin function zip, and itertools.izip both
> return tuples (pairs of entries). Consider:
> 
>>>> a = range(0,10,2)
>>>> b = range(1, 10, 2)
>>>> zip(a,b)
> [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]
> 
> Or, using itertools,
> 
>>>> import itertools
>>>> list(itertools.izip(a,b))
> [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]
> 
> Using the (not very general) interlace function I wrote earlier:
> http://lists.open-bio.org/pipermail/biopython/2009-September/005583.html
> 
>>>> def interlace(iter1, iter2) :
> ...     while True :
> ...         yield iter1.next()
> ...         yield iter2.next()
> ...
>>>> list(interlace(iter(a),iter(b)))
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
> 
> I hope that illustrates the difference. Here you get back ten items,
> but with zip or izip you get five pairs of iterms. Via Google you
> can easily find much more general interlace functions in Python.
> 
> Peter