[Bioperl-l] Can't figure out restriction analysis

Sat Jan 8 15:05:28 UTC 2011

Good idea!
Meanwhile I managed to figure out the cut method which yields coordinates, but I'm waiting now for an hour to see completion of the job for one of the larger chromosomes - that is slow. On the other hand this job will run only once.

I think I have emboss on one of my machines,I will test whether it might be faster (I guess so as you most likely did not run a such time consuming job to generate above output).

Maxim

Von meinem iPod gesendet

Am Jan 8, 2011 um 1:14 PM schrieb Hans-Rudolf Hotz <hrh at fmi.ch>:

> Hi Maxim
> 
> It is not a Bioperl solution, but have you looked at the emboss tool 'restrict' to get the coordinates?
> 
> 
> -bash-3.2$ restrict chr1.fa -enzymes hindiii -sitelen 6 stdout
> Report restriction enzyme cleavage sites in a nucleotide sequence
> ########################################
> # Program: restrict
> # Rundate: Sat  8 Jan 2011 13:08:10
> # Commandline: restrict
> #    [-sequence] /work/gbioinfo/DB/genomes/hg19/chr1.fa
> #    -enzymes hindiii
> #    -sitelen 6
> #    [-outfile] stdout
> # Report_format: table
> # Report_file: stdout
> ########################################
> 
> #=======================================
> #
> # Sequence: chr1     from: 1   to: 249250621
> # HitCount: 64394
> #
> # Minimum cuts per enzyme: 1
> # Maximum cuts per enzyme: 2000000000
> # Minimum length of recognition site: 6
> # Blunt ends allowed
> # Sticky ends allowed
> # DNA is linear
> # Ambiguities allowed
> #
> #=======================================
> 
>  Start     End  Strand Enzyme_name Restriction_site    5prime 3prime 5primerev 3primerev
>  16007   16012       + HindIII     AAGCTT               16007 16011         .         .
>  24571   24576       + HindIII     AAGCTT               24571 24575         .         .
> ///
> 249224235 249224240       + HindIII     AAGCTT           249224235 249224239         .         .
> 249230987 249230992       + HindIII     AAGCTT           249230987 249230991         .         .
> 249231350 249231355       + HindIII     AAGCTT           249231350 249231354         .         .
> 
> #---------------------------------------
> #---------------------------------------
> 
> #---------------------------------------
> # Total_sequences: 1
> # Total_length: 249250621
> # Reported_sequences: 1
> # Reported_hitcount: 64394
> #---------------------------------------
> -bash-3.2$
> 
> 
> for more details see:
> http://emboss.sourceforge.net/apps/cvs/emboss/apps/restrict.html
> 
> 
> Hope this helps.
> Regards, Hans
> 
> 
> 
> On 01/07/2011 10:46 PM, Maxim wrote:
>> Hi,
>> 
>> I'm desperately trying to annotate all HindIII restriction sites for
>> different genomes. The only solution I was able to figure out (after
>> complicated self-made approaches using awk and grep) is shown below. I'm not
>> a programmer, so please excuse the bad coding (strict, warnings etc, this is
>> dedicated to be a "standalone" script). Below code reads the 25 fasta files
>> for human genome one after another. Then restriction analysis is performed,
>> the only value I was able to get returned so far is the fragment length. Am
>> I right, that the fragment length values returned by the @fragments array
>> can be mapped onto the reference genome in a linear fashion, i.e. the
>> fragment length values are ordered according to their location on the
>> reference sequence? If so, I can of course get the positions (coordinates)
>> by simply adding the fragment length values.
>> 
>> More questions:
>> I've seen the _make_cuts function, would this be more appropriate in order
>> to directly retrieve coordinates of cutting positions? Unfortunately I can't
>> get it work like
>> my $analysis = _make_cuts($seq->seq,'HindIII',0)
>> this returns:can't call method "_make_cuts" on an undefined value
>> 
>> Or are both approaches just nonsense and I overlooked another obvious
>> function?
>> Maxim
>> 
>> 
>> 
>> use Bio::Perl;
>> use Bio::SeqIO;
>> use Bio::Restriction::Analysis;
>> use Bio::Restriction::EnzymeCollection;
>> use Cwd;
>> 
>> $dir = getcwd;
>> 
>> $fasta_dir = "/data1/Genomes/HG18/";
>> opendir(DIR, $fasta_dir) or die "$!";
>> @chroms =  grep {/chr/}  readdir DIR;
>> print "found the following files:", join ("\n",  map {$_} @chroms), "\n";
>> #foreach $chrom (@chroms) {print "$chrom\n";}
>> 
>> foreach $file (@chroms)
>> {
>> $outname = $file;
>> @outname = split (/.fa/, $outname);
>> $outname = $dir . "/" . @outname[0] . "_fragments";
>> print "Output filename: $outname\n";
>> #exit;
>> open (OUT, ">$outname");
>> $filelink = $fasta_dir. "/" . $file;
>> my $seqio = Bio::SeqIO->new(-file =>  $filelink, '-format' =>  'Fasta');
>> while(my $seq = $seqio->next_seq)
>> {
>>     my $string = $seq->seq;
>>     #print $string,"\n";
>> 
>>     my $analysis = Bio::Restriction::Analysis->new(-seq =>  $seq);
>>     my  @fragments =  $analysis->fragments('HindIII');
>> 
>>     print OUT join("\n", map {length $_} @fragments), "\n";
>> }
>> close (OUT);
>> print "chromosome file: $file is done!\n";
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l