[Bioperl-l] Re: help extracting CDS

Hilmar Lapp hlapp@gnf.org
Wed, 18 Dec 2002 09:34:35 -0800


On Wednesday, December 18, 2002, at 09:27  AM, Hilmar Lapp wrote:

> In case someone was interested, this in straight bioperl does just 
> that.
>
> #!/usr/bin/perl
> use Bio::SeqIO;
> use Bio::Seq;
> $in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'genbank');
> $out = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta');
> while($seq = $in->next_seq()) {
> 	@ids = ();

Bummer. This line needs to be at the beginning of the foreach loop, not 
here ...

> 	foreach $cds (grep { $_->primary_tag eq 'CDS'; } 
> $seq->get_SeqFeatures()) {

		@ids = ();

-hilmar

> 		push(@ids, grep { /^GI:/; } $cds->get_tag_values('db_xref'));
> 		push(@ids,$cds->get_tag_values('product')) if 
> $cds->has_tag('product');
> 		push(@ids,$cds->get_tag_values('name')) if $cds->has_tag('name');
> 		push(@ids,$cds->get_tag_values('protein_id'));
> 		($seq) = $cds->get_tag_values('translation');
> 		$out->write_seq(Bio::Seq->new(-seq => $seq,
> 		                              -display_id => join('|',@ids)));
> 	}
> }
>
> Some remarks: not every CDS feature has a translation tag. The above 
> code will fail on these in a non-silent manner, but you can protect 
> against that by appending "if $cds->has_tag('translation')" to the 
> respective line. Also, double-quotes also aren't always there.
>
> One of the problems with genbank is that it would be relatively 
> trivial to parse if they followed their format strictly.
>
> 	-hilmar
>
> On Wednesday, December 18, 2002, at 07:55  AM, Pedro Antonio Reche 
> wrote:
>
>> Hi, the ways of programing with bioperl remains misterious to me.
>> Despite the always excellent help of Jason Stajich I could not find
>> simple way to get the CDS from a genbank record using bioperl. At 
>> least
>> for now.
>> Anyway, in case someone was interested in my original post, this code 
>> in
>> straight perl does just that.
>> Regards.
>> #!/usr/sbin/perl -w
>> use strict;
>> $/ = "\n     CDS";
>> <>;	# to skeep header
>> while ( <> ) {
>>     my ($gname) = /product="([^"]+)"/;#sometimes /product= is replace 
>> by
>> /name=
>>     $gname      =~ s/\s+//g;
>>     my ($ref)   = /protein_id="([\w.]+)"/;
>>     my ($gid)   = /db_xref="(GI:\w+)"/;
>>     my ($seq)   = /translation="([A-Z\s]+)"/;
>>     $seq        =~ s/\s+//g;
>>
>>     print ">$gid|$gname|$ref\n$seq\n";
>> }
>>>
>>> Hi, I need to extract the CDS from a genbank genome record, saving 
>>> them
>>> into file in  fasta format, and I wonder if someone can let me know 
>>> how
>>> to do this using bioperl.
>>> Tanks in advance for any positive consideration.
>>>
>>> pedro
>>>
>>> *******************************************************************
>>> PEDRO A. RECHE , pHD            TL: 617 632 3824
>>> Dana-Farber Cancer Institute,   FX: 617 632 4569
>>> Harvard Medical School,         EM: reche@research.dfci.harvard.edu
>>> 44 Binney Street, D1510A,       EM: reche@mifoundation.org
>>> Boston, MA 02115                URL: http://www.reche.org
>>> *******************************************************************
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>> -- 
>> *******************************************************************
>> PEDRO A. RECHE , pHD		TL: 617 632 3824
>> Dana-Farber Cancer Institute,	FX: 617 632 4569
>> Harvard Medical School,		EM: reche@research.dfci.harvard.edu
>> 44 Binney Street, D1510A,	EM: reche@mifoundation.org		
>> Boston, MA 02115		URL: http://www.reche.org   						
>> *******************************************************************
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------