[Bioperl-l] fastq splitter

Wed Feb 29 17:00:39 UTC 2012

Key part of that is the second descriptor (e.g. the one for the qual line) is *optional*, otherwise if there is anything present it must match the descriptor for the sequence.

My concern is the error you mentioned doesn't exist in that version.  The only explanation is the wrong fastq parser version is being used.  Is it possible you have two versions of bioperl installed (maybe a system one and a local one)?  Or is the script you have invoked directly as an executable, maybe calling a specific version of perl with it's own @INC?

chris

On Feb 29, 2012, at 10:30 AM, Sean O'Keeffe wrote:

> Hi Chris,
> Here's the perldoc for fastq - it does seem to indicate that the optional descriptor (+) must match the first header. (See DESCRIPTION). 
> Maybe this version doesn't include the updated code you mention which Peter Cock has worked on.
> 
> Sean.
> 
> ======================================
>  
> Bio::SeqIO::fastq(3)  User Contributed Perl Documentation Bio::SeqIO::fastq(3)
> 
> NAME
>        Bio::SeqIO::fastq - fastq sequence input/output stream
> 
> SYNOPSIS
>          ################## pertains to FASTQ parsing only ##################
> 
>          # grabs the FASTQ parser, specifies the Illumina variant
>          my $in = Bio::SeqIO->new(-format    => ’fastq-illumina’,
>                                   -file      => ’mydata.fq’);
> 
>          # simple ’fastq’ format defaults to ’sanger’ variant
>          my $out = Bio::SeqIO->new(-format    => ’fastq’,
>                                   -file      => ’>mydata.fq’);
> 
>          # $seq is a Bio::Seq::Quality object
>          while (my $seq = $in->next_seq) {
>              $out->write_seq($seq);  # convert Illumina 1.3 to Sanger format
>          }
> 
>          # for 4x faster parsing, one can do something like this for raw data
>          use Bio::Seq::Quality;
> 
>          # $data is a hash reference containing all arguments to be passed to
>          # the Bio::Seq::Quality constructor
>          while (my $data = $in->next_dataset) {
>              # process $data, such as trim, etc
>              my $seq = Bio::Seq::Quality->new(%$data);
> 
>              # for now, write_seq only accepts Bio::Seq::Quality, but may be modified
>              # to allow raw hash references for speed
>              $out->write_seq($data);
>          }
> 
> DESCRIPTION
>        This object can transform Bio::Seq and Bio::Seq::Quality objects to and from FASTQ flat file databases.
> 
>        FASTQ is a file format used frequently at the Sanger Centre and in next-gen sequencing to bundle a FASTA sequence and its quality data. A typical FASTQ entry takes the from:
> 
>          @HCDPQ1D0501
>          GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT.....
>          +HCDPQ1D0501
>          !’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65.....
> 
>        where:
> 
>          @ = descriptor, followed by one or more sequence lines
>          + = optional descriptor (if present, must match first one), followed by one or
>              more qual lines
> 
>        FASTQ and Bio::Seq::Quality mapping
> 
>        FASTQ files have sequence and quality data on single line or multiple lines, and the quality values are single-byte encoded. Data are mapped very simply to Bio::Seq::Quality instances:
> 
>            Data                                        Bio::Seq::Quality method
>            ------------------------------------------------------------------------
>            first non-whitespace chars in descriptor    id^
>            descriptor line                             desc^
>            sequence lines                              seq
>            quality                                     qual*
>            FASTQ variant                               namespace
> 
>            ^ first nonwhitespace chars are id(), everything else after (to end of line)
>              is in desc()
>            * Converted to PHRED quality scores where applicable (’solexa’)
> 
>        FASTQ variants
> 
>        This parser supports all variants of FASTQ, including Illumina v 1.0 and 1.3:
> 
>            variant                note
>            -----------------------------------------------------------
>            sanger                 original
>            solexa                 Solexa, Inc. (2004), aka Illumina 1.0
>            illumina               Illumina 1.3
> 
>        The variant can be specified by passing by either passing the additional -variant parameter to the constructor:
> 
>          my $in = Bio::SeqIO->new(-format    => ’fastq’,
>                                   -variant   => ’solexa’,
>                                   -file      => ’mysol.fq’);
> 
>        or by passing the format and variant together (Bio::SeqIO will now handle this and convert it accordingly to the proper argument):
> 
>          my $in = Bio::SeqIO->new(-format    => ’fastq-solexa’,
>                                   -file      => ’mysol.fq’);
> 
>        Variants can be converted back and forth from one another; however, due to the difference in scaling for solexa quality reads, converting from ’illumina’ or ’sanger’ FASTQ to solexa is not recommended.
> ....
> 
> ============================================
> 
> On 28 February 2012 21:40, Fields, Christopher J <cjfields at illinois.edu> wrote:
> That should work.  Can you send the output of 'perldoc Bio::SeqIO::fastq'?  That should indicate what is being called.
> 
> chris
> 
> On Feb 28, 2012, at 5:50 PM, Sean O'Keeffe wrote:
> 
> > $ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
> > 1.006001
> >
> > Isn't that 1.6.1 - does it need upgrading ?
> >
> > On 28 February 2012 18:36, Sean O'Keeffe <limericksean at gmail.com> wrote:
> > Could be. I'll check.
> >
> > On 28 February 2012 17:17, Fields, Christopher J <cjfields at illinois.edu> wrote:
> > That's a bit odd.  Are you using an old version of the FASTQ parser?  It was revised a while ago, prior to the v1.6.1 release (the error matches one in the older parser)
> >
> > chris
> >
> > On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
> >
> > > Hi Chris,
> > > Unfortunately the read pairs are not consecutive. It seems they are cat'd together.
> > > I could use split -l on the line number that they're glued together I guess.
> > > If this is an overnight job for a bunch of files, I can wait so don't mind using the module if it worked.
> > >
> > > Someone pointed out I need to switch $seqin->desc to $inseq->desc.
> > > However, now it spits out fasta output instead of fastq and returns a bunch of warnings: Seq/Qual descriptions don't match; using sequence description
> > >
> > > Hmm.
> > >
> > > On 28 February 2012 16:50, Fields, Christopher J <cjfields at illinois.edu> wrote:
> > > Sean,
> > >
> > > If you trust the data enough, in that:
> > >
> > > 1) each record is 4 lines,
> > > 2) mate pairs are consecutive in the file, and
> > > 3) that read 1 always preceeds read 2 in the pair,
> > >
> > > then I would simply iterate through 4 lines at a time and dump to the two separate files, maybe using a flip-flop or simple record count and modulus switch.  You can always run a check on the header with a regex if you don't trust it completely.
> > >
> > > Just from the sanity point-of-view, unless you're doing a lot of validation I wouldn't use Bio::SeqIO::fastq, unless you have some time on your hands and a relatively low number of seqs (it's notoriously slow at the moment).
> > >
> > > chris
> > >
> > > On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
> > >
> > > > Hi,
> > > > I'm trying to write a quick script to separate one large PE fastq file into
> > > > 2 separate files, one for each mate pair
> > > >
> > > > The file is of the format (mate1)
> > > > @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
> > > > CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
> > > > +
> > > > BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
> > > >
> > > > && (mate2)
> > > >
> > > > @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
> > > > TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
> > > > +
> > > > ##################################################
> > > >
> > > >
> > > > My idea is to separate using a regex such that / 1:/ would be the first
> > > > mate pair and / 2:/ would go in the second mate file.
> > > > I implemented the code below but each output file is empty. Can someone
> > > > spot my error?
> > > >
> > > > Thanks,
> > > > Sean.
> > > >
> > > > my $infile   = shift;
> > > > my $outfile1 = $infile."_1";
> > > > my $outfile2 = $infile."_2";
> > > >
> > > > my $seqin = Bio::SeqIO->new(
> > > >                             -file   => "<$infile",
> > > >                             -format => "fastq",
> > > >                             );
> > > > my $seqout1 = Bio::SeqIO->new(
> > > >                              -file   => ">$outfile1",
> > > >                              -format => "fastq",
> > > >                              );
> > > >
> > > > my $seqout2 = Bio::SeqIO->new(
> > > >                              -file   => ">$outfile2",
> > > >                              -format => "fastq",
> > > >                              );
> > > > while (my $inseq = $seqin->next_seq) {
> > > >    if ($seqin->desc =~ / 1:/){
> > > >      $seqout1->write_seq($inseq);
> > > >    } else {
> > > >      $seqout2->write_seq($inseq);
> > > >    }
> > > > }
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l at lists.open-bio.org
> > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> > >
> >
> >
> >
> 
>