[Bioperl-l] How does bioperl handle Genbank sequence records? [buggy behavior?]
Hilmar Lapp
hlapp@gnf.org
Fri, 2 Aug 2002 13:56:49 -0600
Danny,
many people from the list will currently be attending BOSC as am I ...
join() locations are split locations in Bioperl speak and should be
handled OK. Specifically, the $feat->location() will return a
Bio::Location::SplitLocation instance. There is a strategy object
handling how start() and end() are computed for a split location
(the default takes the widest range). So, you will get a sequence
spanning the entire region from the start of the first to the end of
the last location. At last night's BoF Ewan proposed an additional
method on features concatenating the sub-sequences for split
locations ... if you feel up to this, you're more than welcome to
make a stab at it.
-hilmar
BTW as for the error message, the problem lies elsewhere: the
indicated error is indeed there in the printed section: there's an
unmatched double-quote. However, it shouldn't affect the feature
you're interested in, as it gently skips over the affected one only.
On Friday, August 2, 2002, at 01:27 PM, Danny Yoo wrote:
>
> Hi eveyrone,
>
> I sent a message about this a week ago, but haven't gotten any
> responses
> yet. Can anyone check to see if the Genbank parser is supposed to
> handle
> 'join(a...b, c..d)' Locations?
>
> Sorry about my impatience; I just want to know if I'm using the module
> correctly.
>
>
>
> ---------- Forwarded message ----------
> Date: Thu, 25 Jul 2002 15:37:02 -0700 (PDT)
> From: Danny Yoo <dyoo@acoma.Stanford.EDU>
> To: bioperl-l@bioperl.org
> Cc: all@acoma.Stanford.EDU
> Subject: How does bioperl handle Genbank sequence records?
>
> Hi everyone,
>
> I'm using Bioperl 1.0 to parse out some sequences from GenBank. In
> particular, I've been parsing:
>
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide&
> list_uids=13449290&dopt=GenBank
>
> As I was going through the 'RPL2' gene and cds feature, I noticed a
> discrepency in sequence lengths. Here's the relevant coordinate
> records
> in Genbank:
>
> gene 154744..157345
> /gene="rpl2"
> mRNA join(154744..155660,157213..157345)
> /gene="rpl2"
> CDS join(154744..155660,157213..157345)
> /gene="rpl2"
>
> How does Bioperl handle the 'join(154744..15660,157213..157345)'
> description when extracting sequences?
>
>
>
> Here's a test program I wrote to double check things:
>
> ######
> use strict;
> use Bio::DB::GenBank;
>
> sub getRPL2 {
> my $gb = Bio::DB::GenBank->new();
> my $seqobj = $gb->get_Seq_by_acc('NC_001284');
> for my $feat ($seqobj->top_SeqFeatures()) {
> if ($feat->primary_tag eq 'CDS'
> && $feat->gff_string =~ /gene "?rpl2"?/) {
> return $feat;
> }
> }
> }
>
> my $rpl2 = getRPL2();
> my $rpl2_seq = $rpl2->seq->seq;
> print length($rpl2_seq), "\n";
> ######
>
>
> When I run this program:
>
> ###
> $ perl test_genbank.pl
> -------------------- WARNING ---------------------
> MSG: [NC_001284] is not a normal sequence database but a RefSeq entry.
> Redirecting the request.
>
> ---------------------------------------------------
> -------------------- WARNING ---------------------
> MSG: Unbalanced quote in:
> /gene="orf153a"
> /codon_start=1
> /protein_id="NP_085474.1"
> /db_xref="GI:13449291"
> /db_xref="SPTR
> /translation="MSLLFQQTVPLSHLHRSLDPPLCFRTHILLILLLLSRHLPGFTG
> SDCESADPSIVSAIAPGTATTSERDCPVRTAGSDPVPIGDSGTFFDVGTAAPELLSPN
> RHHMITRAKDGIRKPNPRYNLFTQKYTPSEPKTITSASQDGDKLCKKRCRH"
> No further qualifiers will be added for this feature
> ---------------------------------------------------
> 2602
> ###
>
> I get a sequence of length 2602. Am I using this correctly?
>
>
> It appears that Bioperl is grabbing the sequence from '154744..157345'
> (whose length is 157345 - 154744 + 1), rather than what I had expected
> from 'join(154744..155660,157213..157345)' of length 1051.
>
>
> Thank you for any help you can give about this.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------