[Bioperl-l] randomizing fastq sequences

simon andrews (BI) simon.andrews at bbsrc.ac.uk
Wed Feb 9 08:35:34 UTC 2011


On 8 Feb 2011, at 16:48, shalu sharma wrote:

> Hi All,
>    Thanks for all the suggestions.
> @Simon Andrew and Roy:
>   Your method worked perfect but now memory is the issue.
> Now i have to select 50K fastq sequences from a illumina data (around 70 mil
> reads) randomly , so is there again any module that can select random
> sequences from fastq file?

The simple approach to this is to do it in two passes.  

In the first pass you simply find out how many fastq entries you have in your file.  You then randomly select 50K numbers from 1..[number of fastq seqs in file].

In the second pass you pull out any sequences at an index position you randomly selected.  If you don't mind them being in the same order then you can just write them out immediately and use virtually no memory, or you could put them in an array and shuffle them before writing (using the same memory as the 50K experiment).

If you're going to be doing a lot of shuffling on the same dataset then it would be worth looking into doing a proper indexing of your file, as others have suggested, but if you're only going to do this once per dataset then it might not save you any time.


> Also at some point i need to shuffle the  fastq reads (order of
> nucleotides).

Same basic process - extract the sequence, split it into an array, shuffle the array, reassign the sequence.  If you want to keep the original quality scores associated with the same bases you'll need to shuffle indices and then reassemble both the shuffled sequences and qualities at the same time.


Simon.








More information about the Bioperl-l mailing list