[Bioperl-l] Re: help extracting CDS
Hilmar Lapp
hlapp@gnf.org
Wed, 18 Dec 2002 09:27:24 -0800
In case someone was interested, this in straight bioperl does just that.
#!/usr/bin/perl
use Bio::SeqIO;
use Bio::Seq;
$in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'genbank');
$out = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta');
while($seq = $in->next_seq()) {
@ids = ();
foreach $cds (grep { $_->primary_tag eq 'CDS'; }
$seq->get_SeqFeatures()) {
push(@ids, grep { /^GI:/; } $cds->get_tag_values('db_xref'));
push(@ids,$cds->get_tag_values('product')) if
$cds->has_tag('product');
push(@ids,$cds->get_tag_values('name')) if $cds->has_tag('name');
push(@ids,$cds->get_tag_values('protein_id'));
($seq) = $cds->get_tag_values('translation');
$out->write_seq(Bio::Seq->new(-seq => $seq,
-display_id => join('|',@ids)));
}
}
Some remarks: not every CDS feature has a translation tag. The above
code will fail on these in a non-silent manner, but you can protect
against that by appending "if $cds->has_tag('translation')" to the
respective line. Also, double-quotes also aren't always there.
One of the problems with genbank is that it would be relatively trivial
to parse if they followed their format strictly.
-hilmar
On Wednesday, December 18, 2002, at 07:55 AM, Pedro Antonio Reche
wrote:
> Hi, the ways of programing with bioperl remains misterious to me.
> Despite the always excellent help of Jason Stajich I could not find
> simple way to get the CDS from a genbank record using bioperl. At least
> for now.
> Anyway, in case someone was interested in my original post, this code
> in
> straight perl does just that.
> Regards.
> #!/usr/sbin/perl -w
> use strict;
> $/ = "\n CDS";
> <>; # to skeep header
> while ( <> ) {
> my ($gname) = /product="([^"]+)"/;#sometimes /product= is replace
> by
> /name=
> $gname =~ s/\s+//g;
> my ($ref) = /protein_id="([\w.]+)"/;
> my ($gid) = /db_xref="(GI:\w+)"/;
> my ($seq) = /translation="([A-Z\s]+)"/;
> $seq =~ s/\s+//g;
>
> print ">$gid|$gname|$ref\n$seq\n";
> }
>>
>> Hi, I need to extract the CDS from a genbank genome record, saving
>> them
>> into file in fasta format, and I wonder if someone can let me know
>> how
>> to do this using bioperl.
>> Tanks in advance for any positive consideration.
>>
>> pedro
>>
>> *******************************************************************
>> PEDRO A. RECHE , pHD TL: 617 632 3824
>> Dana-Farber Cancer Institute, FX: 617 632 4569
>> Harvard Medical School, EM: reche@research.dfci.harvard.edu
>> 44 Binney Street, D1510A, EM: reche@mifoundation.org
>> Boston, MA 02115 URL: http://www.reche.org
>> *******************************************************************
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>
> --
> *******************************************************************
> PEDRO A. RECHE , pHD TL: 617 632 3824
> Dana-Farber Cancer Institute, FX: 617 632 4569
> Harvard Medical School, EM: reche@research.dfci.harvard.edu
> 44 Binney Street, D1510A, EM: reche@mifoundation.org
> Boston, MA 02115 URL: http://www.reche.org
> *******************************************************************
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------