[Bioperl-l] How does bioperl handle Genbank sequence records? [buggy behavior?]

Danny Yoo dyoo@acoma.Stanford.EDU
Fri, 2 Aug 2002 12:27:49 -0700 (PDT)


Hi eveyrone,

I sent a message about this a week ago, but haven't gotten any responses
yet.  Can anyone check to see if the Genbank parser is supposed to handle
'join(a...b, c..d)' Locations?

Sorry about my impatience; I just want to know if I'm using the module
correctly.



---------- Forwarded message ----------
Date: Thu, 25 Jul 2002 15:37:02 -0700 (PDT)
From: Danny Yoo <dyoo@acoma.Stanford.EDU>
To: bioperl-l@bioperl.org
Cc: all@acoma.Stanford.EDU
Subject: How does bioperl handle Genbank sequence records?

Hi everyone,

I'm using Bioperl 1.0 to parse out some sequences from GenBank.  In
particular, I've been parsing:


http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide&list_uids=13449290&dopt=GenBank

As I was going through the 'RPL2' gene and cds feature, I noticed a
discrepency in sequence lengths.  Here's the relevant coordinate records
in Genbank:

     gene            154744..157345
                     /gene="rpl2"
     mRNA            join(154744..155660,157213..157345)
                     /gene="rpl2"
     CDS             join(154744..155660,157213..157345)
                     /gene="rpl2"

How does Bioperl handle the 'join(154744..15660,157213..157345)'
description when extracting sequences?



Here's a test program I wrote to double check things:

######
use strict;
use Bio::DB::GenBank;

sub getRPL2 {
    my $gb = Bio::DB::GenBank->new();
    my $seqobj = $gb->get_Seq_by_acc('NC_001284');
    for my $feat ($seqobj->top_SeqFeatures()) {
	if ($feat->primary_tag eq 'CDS'
	    && $feat->gff_string =~ /gene "?rpl2"?/) {
	    return $feat;
	}
    }
}

my $rpl2 = getRPL2();
my $rpl2_seq = $rpl2->seq->seq;
print length($rpl2_seq), "\n";
######


When I run this program:

###
$ perl test_genbank.pl
-------------------- WARNING ---------------------
MSG: [NC_001284] is not a normal sequence database but a RefSeq entry.
Redirecting the request.

---------------------------------------------------
-------------------- WARNING ---------------------
MSG: Unbalanced quote in:
/gene="orf153a"
/codon_start=1
/protein_id="NP_085474.1"
/db_xref="GI:13449291"
/db_xref="SPTR
/translation="MSLLFQQTVPLSHLHRSLDPPLCFRTHILLILLLLSRHLPGFTG
SDCESADPSIVSAIAPGTATTSERDCPVRTAGSDPVPIGDSGTFFDVGTAAPELLSPN
RHHMITRAKDGIRKPNPRYNLFTQKYTPSEPKTITSASQDGDKLCKKRCRH"
No further qualifiers will be added for this feature
---------------------------------------------------
2602
###

I get a sequence of length 2602.  Am I using this correctly?


It appears that Bioperl is grabbing the sequence from '154744..157345'
(whose length is 157345 - 154744 + 1), rather than what I had expected
from 'join(154744..155660,157213..157345)' of length 1051.


Thank you for any help you can give about this.