[Bioperl-l] extract ncDNA
perlmails at gmail.com
perlmails at gmail.com
Sun Feb 26 11:51:37 UTC 2006
Dear Bioperl group,
I have been working on extracting non-coding DNA (ncDNA) sequences
from an organimsm.
I tried extracting the intergenic sequences from the sense-strand
after filtering the features (CDS, gene, mRNA, tRNA, rRNA etc) from
the EMBL feature table entries using the Bioperl and the additional
script (mentioned below).
Now, I realised that there is a problem to extract the ncDNA sequences
from the negative-strand, Any ideas?
To extract the ncDNAs from negative-strand, I thought of converting
the negative-strand co-ordinates to sense-strand co-ordinates and
adding these to the sense-strand cords. Then filter all the features
(select the ncDNAs after discarding the features from EMBL FT) to get
all the ncDNAs.
Is there anything I am missing for using from the bioperl kit?
##<<<code start>>
use strict;
my $EMBL_cord_file = "Organism.feature.cords"; # feature
co-ordinates: start \t end
my $RAW_file = "Organism.raw";
my $ncDNA_file = "Organism.ncDNA";
open(EMBLCORD, $EMBL_cord_file) or die "Canot open EMBL_cord_file";
open(RAW, $RAW_file) or die "Canot open RAW_file";
open(OUT, ">$ncDNA_file") or die;
my @dna=<RAW>;
my $dna = join('', at dna);
while($dna){
$dna=~s/\s//g;
while(<EMBLCORD>){
my @cords = split /\t/;
my $start = $cords[0];
my $end = $cords[1];
my $replaceString = "\n>$start..$end";
substr($dna, $start-1, $end-$start+1, $replaceString);
}
print OUT $dna,"\n";
exit;
}
##<<<code end>>
Another thing is, since I am reading the whole file in a scalar the
script does not complete the extraction of all ncDNAs from the
sense-strand. Obviously, the features are parsed first before the
flattening of the 266,000 nt sequence into a single string.
Any help would be appreciated.
-PO
More information about the Bioperl-l
mailing list