[Bioperl-l] randomizing fastq sequences

Tue Feb 8 17:29:56 UTC 2011

Shalu,

(Note: this isn't a Perl solution): I do think this problem has been solved somewhat in R/BioC if you have it installed, in the ShortRead package (see 'Sampler-class' in the ShortRead docs).

I think using perl and the current BioPerl Bio::Index::Fastq indexing scheme for FASTQ will be problematic/slow for very large files with millions of sequences (i.e. pretty much anything that is rolling out of modern day sequencing pipelines), as the current indexing implementation uses a very simple indexing scheme using DB_File, originally designed years ago for much smaller sequencing samples.  Think: Sanger sequencing.  

Of course, this is with the caveat that I haven't tested this out personally, but I recall some complaints about this in the past (Jason?).  There was an effort to deal with this at one point with AnyDBM_File (which allows SQLite now) but I don't think it progressed very far, primarily b/c there simply hasn't been enough demand.  Most users seem to sample randomly from BAM files instead, which are conveniently accessible via samtools/Picard/bamtools/etc (bamtools has a 'random' option for this purpose).

chris

On Feb 8, 2011, at 10:48 AM, shalu sharma wrote:

> Hi All,
>     Thanks for all the suggestions.
> @Simon Andrew and Roy:
>    Your method worked perfect but now memory is the issue.
> Now i have to select 50K fastq sequences from a illumina data (around 70 mil reads) randomly , so is there again any module that can select random sequences from fastq file?
> 
> I can still use same methods on 50k sequences but getting 50k from huge data set is a problem.
> Also at some point i need to shuffle the  fastq reads (order of nucleotides).
> 
> I am really sorry for asking lot of things , i know i am really bad in handling fastq sequences.
> i would really appreciate your suggestions.
> 
> Thanks
> Shalu 
> 
> On Tue, Feb 8, 2011 at 10:53 AM, Chris Fields <cjfields at illinois.edu> wrote:
> Just to note, I have been thinking about wrapping this for fast indexing and retrieval of FASTQ for bioperl (this came up in a prior thread, with the same suggestion from Malcolm IIRC).
> 
> chris
> 
> On Feb 8, 2011, at 9:12 AM, Cook, Malcolm wrote:
> 
> > Gotta chime in....
> >
> > If
> >       you're working with fastq files
> >       are working in unix and have the `shuf` command available
> >
> > I recommand you to install cdbyank http://sourceforge.net/projects/cdbfasta/ which provides for indexing fasta and fastq files and providing random access to them
> >
> > Index the fastq, then extract the IDs with cdyank, pipe them through `shuf` and then through cdyank again to pull out the sequences.
> >
> > Like this example, which uses a test fastq from my local install of bioperl:
> >
> >> cd ~/local/src/bioperl-live/t/data/fastq/
> >> cdbfasta -Q example.fastq
> > 3 entries from file example.fastq were indexed in file example.fastq.cidx
> >> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx > shuf_example.fastq
> >
> > There would be issues if your IDs are not unique.
> >
> > Malcolm Cook
> > Stowers Institute for Medical Research -  Bioinformatics
> > Kansas City, Missouri  USA
> >
> >
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org
> >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
> >> shalu sharma
> >> Sent: Monday, February 07, 2011 4:08 PM
> >> To: bioperl-l at lists.open-bio.org
> >> Subject: [Bioperl-l] randomizing fastq sequences
> >>
> >> Hi,
> >>   i am trying to test one program for which i need to change
> >> order of sequences in a fastq file.
> >> My fastq file contains about 50,000 sequences.
> >> Is there any script that can do this task?
> >>
> >> Thanks
> >> Shalu
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
>