[Bioperl-l] fastq splitter - working but not before xmas!!

Thu May 24 19:23:54 UTC 2012

Thanks Chris. I'd be happy to give it a whirl whenever it's ready.
I used a one liner grep command which got the job done:

grep -A 3 '1:[NY]:0' <in.fastq> > <out.fastq_1>
grep -A 3 '2:[NY]:0' <in.fastq> > <out.fastq_2>

Sean.

On 24 May 2012 15:05, Fields, Christopher J <cjfields at illinois.edu> wrote:
> Sean,
>
> I have been working on a XS-based interface that basically just returns hashrefs, this uses Heng Li's kseq.h library.  I can probably push this out to CPAN sometime in the next week or so.  I did some initial (very rough) benchmarks and when using a simple count it parsed 30M reads in about 25-30 seconds.
>
> chris
>
> On May 16, 2012, at 1:05 PM, Sean O'Keeffe wrote:
>
>> So now I've got a bunch of fastq's all about 17GB in size. The script is
>> puttering away but this is tediously slow.
>> I tried the the fastq-dump tool from sra toolkit but it didn't like my
>> commands (fastq-dump --split-files <input_fastq_file> ) - my ignorance no
>> doubt.
>>
>> Any ideas out there on speeding up Bio::SeqIO::fastq output?
>> Thanks.
>>
>> On 1 March 2012 03:16, Joel Martin <j_martin at lbl.gov> wrote:
>>
>>> Just a caution to double check that the read1 and read2 names match after
>>> splitting.  I don't know if this thread jinxed me or what, but I just for
>>> the first time received a concatenated fastq file formatted as you
>>> describe, except the first read1 doesn't match the first read2.  zut alores!
>>>
>>> came up with converting to scarf, /usr/bin/sort the scarf, then read that
>>> with tossing into single or paired files and reconverting to fastq in the
>>> process.  it wasn't too bad, but I don't think bioperl has a scarf
>>> conversion, it's basically fastq with : substituted for \n.  most
>>> delimeters that aren't : would work better but i already had a fastq2scarf
>>> from early solexa days ( i think ).
>>>
>>> # this was the last step, if it's handy for this plague of hideous files,
>>> the fixed fields for : would need adjusting
>>> use strict;
>>>
>>> open( my $oph, '>', 'paired.fq' ) or die $!;
>>> open( my $osh, '>', 'single.fq' ) or die $!;
>>>
>>> my ( $pend, $pname, $pline );
>>>
>>> while ( <>) {
>>>  my ( $name, $end ) = /^(\S+)\s(\d)/;
>>>
>>>  if ( $end == 1 ) {
>>>    if ( $pend ) {
>>>      print_reads( $osh, $pline );
>>>    }
>>>    $pend = $end;
>>>    $pname = $name;
>>>    $pline = $_;
>>>  }
>>>  elsif ( $end == 2 ) {
>>>    my $fh = $pend == 1 && $pname eq $name ? $oph : $osh;
>>>    print_reads( $fh, $pline, $_ );
>>>    $pend = '';
>>>  }
>>>  else {
>>>    die "ERROR: can't interpret line $. $_";
>>>  }
>>> }
>>> sub print_reads {
>>>  my ( $fh, @reads ) = @_;
>>>  for my $scarf ( @reads ) {
>>>    my @stuff = split /:/,$scarf,12;
>>>    print $fh '@',join(':', at stuff[0..9]),"\n$stuff[10]\n+\n$stuff[11]";
>>>  }
>>> }
>>>
>>> Joel
>>>
>>> On Wed, Feb 29, 2012 at 11:52 AM, George Hartzell <hartzell at alerce.com>wrote:
>>>
>>>> Fields, Christopher J writes:
>>>>> Just want to say, if you can set up a local perl and local::lib it
>>>>> makes your life a LOT easier.  Particularly if you are running jobs
>>>>> on older versions of RHEL, which notoriously stuck with
>>>>> outdated/broken versions of perl (as well as other tools).
>>>>> [...]
>>>>
>>>> And Perlbrew takes away your last excuse for not building perls and
>>>> setting up local::lib's.
>>>>
>>>> http://perlbrew.pl/
>>>>
>>>> g.
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>