[Bioperl-l] ncDNA
Chris Fields
cjfields at uiuc.edu
Wed Mar 8 23:56:30 UTC 2006
You know, it works on smaller sequences, but I found a problem with some
genome sequences which I didn't notice before (particularly M.
tuberculosis). So use this one instead. The problem lies with the fact
that any CDS that don't have a 'gene' tag it will be overlooked. I think
using locus_tag is better to catch all genes, and if gene is present switch
it out. Some genome seqfeatures in GenBank files have gene = locus_tag and
some don't. Try this version, which pulls out everything and if 'gene' is
present will use it instead of the less-useful 'locus_tag':
------------------------------------------
use Bio::SeqIO;
use Bio::Seq;
use strict;
my $starttime = time;
my $file = shift @ARGV;
my $seqio = Bio::SeqIO->new(-file => "$file",
-format => 'genbank');
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
-format => 'fasta');
my @feat = {};
my $seq = $seqio->next_seq;
# grab features, filter for those we want to keep
for my $feature ($seq->get_SeqFeatures) {
next unless $feature->primary_tag =~ /CDS|rRNA|tRNA/;
for my $value ($feature->get_tag_values('locus_tag')) {
push @feat, {'gene' => $feature->get_tag_values('gene')
? $feature->get_tag_values('gene')
: $value,
'start' => $feature->start,
'end' => $feature->end};
}
}
# extract and format subseqs
my $count = 0;
while($feat[$count]) {
if (defined $feat[$count]->{'gene'}) {
# get end of previous gene; if no gene (undef), set to 1 (startpoint)
my $start = $feat[$count-1]->{'end'} ? $feat[$count-1]->{'end'} + 1 :
1 ;
# get start of second gene (endpoint)
my $end = $feat[$count]->{'start'} - 1;
my $igrname = $feat[$count-1]->{'gene'}
?
$feat[$count-1]->{'gene'}."-".$feat[$count]->{'gene'}.' | '.
$start."..".$end
: $feat[$count]->{'gene'}.'|'.
$start."..".$end;
my $igr = $end - $start;
if($igr > 0 ) { # filter out negative or zero values
my $sub = $seq->subseq($start, $end);
my $subseq = Bio::Seq->new(-seq => $sub,
-desc => $igrname);
$seqout->write_seq($subseq);
}
}
$count++;
}
------------------------------------------
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Wednesday, March 08, 2006 4:48 PM
> To: perlmails at gmail.com; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] ncDNA
>
> It's a good idea to mail to the list as well; others may have better or
> alternative ways of doing this.
>
>
>
> Jason mentions masking all regions you don't want (CDS, rRNA, tRNA) by
> changing them to N's, then splitting on multiple N's:
>
>
>
> http://portal.open-bio.org/pipermail/bioperl-l/2005-January/018092.html
>
>
>
> Using substr (as Jason suggests) should also speed things up.
>
>
>
> Here's my solution using only bioperl objects; I use this for some work I
> do
> here. It works but isn't terribly fast with large sequences due to object
> creation. There is very likely a better way with larger sequences. You
> could probably use the first loop to extract seqfeatures and change the
> second part to simple perl to speed things up a bit or try straight regex
> matching combined with substr for maximum velocity. Be careful though;
> bioperl subseq() searches starting with 1 while I think perl substr starts
> at 0!
More information about the Bioperl-l
mailing list