[Bioperl-l] extracting CDS portion of RefSeqs

Thu Dec 15 14:06:58 EST 2005

Amit,

An issue which you seem to be ignoring is that the CDS will often be a
'Join'

i.e., from http://www.ncbi.nlm.nih.gov/projects/collab/FT/

A more complex description:
Key             Location/Qualifiers
CDS             join(544..589,688..>1032)
                /product="T-cell receptor beta-chain"

which materializes in BioPerl as $feat having a Bio::Location::Split as
its location.

So, simply taking the $start and $end can yield you intronic sequence,
which I think you don't want.

You can pass the cds's location to subseq though.

So, your code becomes:

...
		 my $seqstr   = $seq->subseq($feat->location);
...

Regarding performance, I've never tried it, but you might look at 
http://doc.bioperl.org/bioperl-live/Bio/Seq/SeqBuilder.html, which shows
you how to tell SeqIO that you only need to read sequence and features.

I'd be curious to know if it speeds things up for you...

If not, you could filter the input to remove all but essential lines.

Or, just run it and wait...

Malcolm Cook - mec at stowers-institute.org - 816-926-4449
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, MO  USA 

>-----Original Message-----
>From: bioperl-l-bounces at portal.open-bio.org 
>[mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Amit Indap
>Sent: Wednesday, December 14, 2005 10:18 AM
>To: bioperl-l at portal.open-bio.org
>Subject: [Bioperl-l] extracting CDS portion of RefSeqs
>
>Sorry, I hit send before I finished my email
>
>Anyways, I want to extract out the CDS portion of human refseqs. I
>downloaded the most recent refseq release in genbank format. I was
>extracting out the CDS portion this way:
>
> foreach my $feat ( $seq->get_SeqFeatures() ) {
>             if( $feat->primary_tag eq 'CDS' ) {
>		 my $start = $feat->start;
>		 my $end = $feat->end;
>		
>		 my $seqstr   = $seq->subseq($start,$end);
>		 my $displayid = $seq->display_name;
>		 my $seqobj = Bio::Seq->new( -display_id => 
>"$displayid:$start..$end",
>					     -seq => $seqstr);
>		 my $out = Bio::SeqIO->new(-format => 'Fasta');
>		 $out->write_seq($seqobj);
>
>But this is quite slow since the refseq genbank file is quite large.
>Is there anyway to download the CDS portion of refseq from NCBI? Is
>there a quicker BioPerl solution than the one I have?
>
>Thanks for your help.
>
>Amit
>
>--
>Amit Indap
>http://www.bscb.cornell.edu/Homepages/Amit_Indap/
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>