[Bioperl-l] Select random sequences from a fasta file

Jason Stajich jason.stajich at gmail.com
Wed Mar 21 20:07:58 UTC 2012


Hi - 

If they are short reads and just the same length (e.g. one line per sequence) you can do this in plain perl with seek and a RNG to read 2 lines from the file.
The problem in trying to do this in bioperl is the indexing of the multifasta file ends up being really slow when you get past ~4-5M IDs in the hash structure that is used. Plus there isn't a nice way to do this random selection other than to generate the full list of IDs and do the shuffling and pop off a few thousand to do the lookup.  I think this is pretty way overkill for the problem you are trying to solve.

There is a nice utility to do this as part of the Celera Assembler - if you use the gatekeeper tool there is an option after you build a store to then get a dump of a random subselection of the data.  

Jason
On Mar 21, 2012, at 12:42 PM, shalabh sharma wrote:

> Hi All,
>          Is there a way to select random sequences from a multi fasta
> file. I am using some method (not that sophisticated).
> Is there any module in bioperl that can do that?
> 
> I have a fasta file containing around 10 million reads, and i want to get
> few thousand sequences out of it (randomly selected).
> 
> Thanks
> Shalabh
> 
> -- 
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org





More information about the Bioperl-l mailing list