[Bioperl-l] Re: help extracting CDS

Hilmar Lapp hlapp@gnf.org
Wed, 18 Dec 2002 09:27:24 -0800


In case someone was interested, this in straight bioperl does just that.

#!/usr/bin/perl
use Bio::SeqIO;
use Bio::Seq;
$in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'genbank');
$out = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta');
while($seq = $in->next_seq()) {
	@ids = ();
	foreach $cds (grep { $_->primary_tag eq 'CDS'; } 
$seq->get_SeqFeatures()) {
		push(@ids, grep { /^GI:/; } $cds->get_tag_values('db_xref'));
		push(@ids,$cds->get_tag_values('product')) if 
$cds->has_tag('product');
		push(@ids,$cds->get_tag_values('name')) if $cds->has_tag('name');
		push(@ids,$cds->get_tag_values('protein_id'));
		($seq) = $cds->get_tag_values('translation');
		$out->write_seq(Bio::Seq->new(-seq => $seq,
		                              -display_id => join('|',@ids)));
	}
}

Some remarks: not every CDS feature has a translation tag. The above 
code will fail on these in a non-silent manner, but you can protect 
against that by appending "if $cds->has_tag('translation')" to the 
respective line. Also, double-quotes also aren't always there.

One of the problems with genbank is that it would be relatively trivial 
to parse if they followed their format strictly.

	-hilmar

On Wednesday, December 18, 2002, at 07:55  AM, Pedro Antonio Reche 
wrote:

> Hi, the ways of programing with bioperl remains misterious to me.
> Despite the always excellent help of Jason Stajich I could not find
> simple way to get the CDS from a genbank record using bioperl. At least
> for now.
> Anyway, in case someone was interested in my original post, this code 
> in
> straight perl does just that.
> Regards.
> #!/usr/sbin/perl -w
> use strict;
> $/ = "\n     CDS";
> <>;	# to skeep header
> while ( <> ) {
>     my ($gname) = /product="([^"]+)"/;#sometimes /product= is replace 
> by
> /name=
>     $gname      =~ s/\s+//g;
>     my ($ref)   = /protein_id="([\w.]+)"/;
>     my ($gid)   = /db_xref="(GI:\w+)"/;
>     my ($seq)   = /translation="([A-Z\s]+)"/;
>     $seq        =~ s/\s+//g;
>
>     print ">$gid|$gname|$ref\n$seq\n";
> }
>>
>> Hi, I need to extract the CDS from a genbank genome record, saving 
>> them
>> into file in  fasta format, and I wonder if someone can let me know 
>> how
>> to do this using bioperl.
>> Tanks in advance for any positive consideration.
>>
>> pedro
>>
>> *******************************************************************
>> PEDRO A. RECHE , pHD            TL: 617 632 3824
>> Dana-Farber Cancer Institute,   FX: 617 632 4569
>> Harvard Medical School,         EM: reche@research.dfci.harvard.edu
>> 44 Binney Street, D1510A,       EM: reche@mifoundation.org
>> Boston, MA 02115                URL: http://www.reche.org
>> *******************************************************************
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>
> -- 
> *******************************************************************
> PEDRO A. RECHE , pHD		TL: 617 632 3824
> Dana-Farber Cancer Institute,	FX: 617 632 4569
> Harvard Medical School,		EM: reche@research.dfci.harvard.edu
> 44 Binney Street, D1510A,	EM: reche@mifoundation.org		
> Boston, MA 02115		URL: http://www.reche.org   						
> *******************************************************************
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------