[Bioperl-l] extract nonoverlapping subsequences from a whole genome

Tue Apr 10 20:57:20 UTC 2007

Okay, I was bored!  This is a little shorter than that script:

my $seqin = Bio::SeqIO->new(-format => 'fasta',
                             -file => shift);

my $seqout = Bio::SeqIO->new(-format => 'fasta',
                             -file => '>split.fas');

while( my $seq = $seqin->next_seq ) {
     my $seqlength = $seq->length();
     print STDERR "Length is $seqlength\n";
     my $start = 1;
     my $end = 100;
     my $desc = $seq->description;
     CHUNK:
     while ($end <= $seqlength){
         my $ordseq = $seq->trunc($start,$end);
         $ordseq->description("$start-$end $desc");
         $seqout->write_seq($ordseq);
         last CHUNK if $end >= $seqlength;
         $start += 100;
         $end = ($end + 100 > $seqlength) ? $seqlength : $end + 100;
     }
}

chris

On Apr 10, 2007, at 2:42 AM, gopu_36 wrote:

>
> Hi,
> I am one of the newbee venturingout bioperl for my research  
> purposes. I have
> a whole genome sequence of a pathogen. I am trying to break them into
> non-overlapping 1000bps subsequences. For example if my whole genome
> sequence is 400000 bps length, then I should be beak them into 4000
> subsequences of each 1000 bps and they should be non-overlapping  
> but at the
> same time continous. To be precise, my first substring would be  
> from 1 to
> 1000 bps, second substing would be from 1001 to 2000 etcc.. Could  
> anyone
> help me.
> I tried with the following code but it gives me only the first  
> substring and
> rest are not! I would appreciate very much if someone could help me!
> .........
> .
> .
> my $start =1;
> my $finish =100;
> my $inseq  = Bio::SeqIO->new(-file => "$in_file");
> while( my $seq = $inseq->next_seq ) {
> 	
> 	my $cleseq = $seq->seq();
> 	
> 	$seqlength = $seq->length();
> 	if ($finish<$seqlength){	
> 	print "The length of the sequence is $seqlength\n";	
> 	my $ordseq = $cleseq->subseq($start,$finish);
>           push(@seq_array,$ordseq);
>           $start=+100;
>           $finish=+100;
>           $counter++;
>           next;          	
>        }
> }
> -- 
> View this message in context: http://www.nabble.com/extract- 
> nonoverlapping-subsequences-from-a-whole-genome- 
> tf3551560.html#a9915265
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign