[Bioperl-l] fastq splitter
Florent Angly
florent.angly at gmail.com
Wed Feb 29 22:33:04 UTC 2012
Also, the desc() method returns the part after the whitespace in the
FASTA header.
Hence, instead of / 1:/, your regular expression should not have the
space and should be written /1:/. In fact, it would be even better
(faster) it it were written as an anchored regular expression that
matches only the beginning of the description, /^1:/
Note that you are apparently using the latest Illumina format, that does
not follow previous convention on paired-end read headers. Hence your
script will not work properly with non-latest-Illumina paired-end files.
Florent
On 29/02/12 07:26, Michael Muratet wrote:
>
> On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
>
>> Hi,
>> I'm trying to write a quick script to separate one large PE fastq
>> file into
>> 2 separate files, one for each mate pair
>>
>> The file is of the format (mate1)
>> @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
>> CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
>> +
>> BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
>>
>> && (mate2)
>>
>> @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
>> TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
>> +
>> ##################################################
>>
>>
>> My idea is to separate using a regex such that / 1:/ would be the first
>> mate pair and / 2:/ would go in the second mate file.
>> I implemented the code below but each output file is empty. Can someone
>> spot my error?
>>
>> Thanks,
>> Sean.
>>
>> my $infile = shift;
>> my $outfile1 = $infile."_1";
>> my $outfile2 = $infile."_2";
>>
>> my $seqin = Bio::SeqIO->new(
>> -file => "<$infile",
>> -format => "fastq",
>> );
>> my $seqout1 = Bio::SeqIO->new(
>> -file => ">$outfile1",
>> -format => "fastq",
>> );
>>
>> my $seqout2 = Bio::SeqIO->new(
>> -file => ">$outfile2",
>> -format => "fastq",
>> );
>> while (my $inseq = $seqin->next_seq) {
>> if ($seqin->desc =~ / 1:/){
> Hi Sean
>
> You're using the desc operator on the stream, not the seq object.
>
> Cheers
>
> Mike
>
>> $seqout1->write_seq($inseq);
>> } else {
>> $seqout2->write_seq($inseq);
>> }
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Michael Muratet, Ph.D.
> Senior Scientist
> HudsonAlpha Institute for Biotechnology
> mmuratet at hudsonalpha.org
> (256) 327-0473 (p)
> (256) 327-0966 (f)
>
> Room 4005
> 601 Genome Way
> Huntsville, Alabama 35806
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list