[Bioperl-l] extract ncDNA

Sun Feb 26 11:51:37 UTC 2006

Dear Bioperl group,

I have been working on extracting non-coding DNA (ncDNA) sequences
from an organimsm.

I tried extracting the intergenic sequences from the sense-strand
after filtering the features (CDS, gene, mRNA, tRNA, rRNA etc) from
the EMBL feature table entries using the Bioperl and the additional
script (mentioned below).

Now, I realised that there is a problem to extract the ncDNA sequences
from the negative-strand, Any ideas?

To extract the ncDNAs from negative-strand, I thought of converting
the negative-strand co-ordinates to sense-strand co-ordinates and
adding these to the sense-strand cords. Then filter all the features
(select the ncDNAs after discarding the features from EMBL FT) to get
all the ncDNAs.

Is there anything I am missing for using from the bioperl kit?

##<<<code start>>
use strict;

my $EMBL_cord_file = "Organism.feature.cords";  # feature
co-ordinates: start \t end
my $RAW_file = "Organism.raw";
my $ncDNA_file = "Organism.ncDNA";

open(EMBLCORD, $EMBL_cord_file) or die "Canot open EMBL_cord_file";
open(RAW, $RAW_file) or die "Canot open RAW_file";
open(OUT, ">$ncDNA_file") or die;

my @dna=<RAW>;
my $dna = join('', at dna);

while($dna){
	$dna=~s/\s//g;
	while(<EMBLCORD>){
		my @cords = split /\t/;
		my	$start = $cords[0];
		my	$end = $cords[1];
		my $replaceString = "\n>$start..$end";
		substr($dna, $start-1, $end-$start+1, $replaceString);
}
	print OUT $dna,"\n";
	exit;
}
##<<<code end>>

Another thing is, since I am reading the whole file in a scalar the
script does not complete the extraction of all ncDNAs from the
sense-strand. Obviously, the features are parsed first before the
flattening of the 266,000 nt sequence into a single string.

Any help would be appreciated.

-PO