[Bioperl-l] Odd problem with get_tag_values

Adlai Burman adlai at refenestration.com
Fri Feb 24 23:07:37 UTC 2012


I apologize, Chris. That was a terrible example I sent. I accidentally sent the only record for which there actually are no regular 'gene' tags. Other gb records that failed before Jason's tag check include NC_005086 and they all have regular 'gene' tags. I should have referenced that one.
I appreciate your checking into that and sorry about the red herring. Boy is my face blushed.

A.


On Feb 24, 2012, at 11:38 PM, Fields, Christopher J wrote:

> There is possibly a slight disconnect here.  The primary tag you are looking for is 'CDS'.  Here is an example of both the primary tag for 'CDS' and 'gene' from NC_012927:
> 
>     gene            complement(join(91716..92516,69640..69753))
>                     /locus_tag="BaolC_p001"
>                     /trans_splicing
>                     /db_xref="GeneID:8223103"
>     CDS             complement(join(91716..91744,92285..92516,69640..69753))
>                     /locus_tag="BaolC_p001"
>                     /trans_splicing
>                     /codon_start=1
>                     /transl_table=11
>                     /product="ribosomal protein S12"
>                     /protein_id="YP_003029720.1"
>                     /db_xref="GI:253729537"
>                     /db_xref="GeneID:8223103"
>                     /translation="MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKK
>                     PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTL
>                     DAVAVKNRQQGRSKYGVKKPKK"
> 
> This 'CDS' example has no 'gene' regular tag, in fact none that I looked at did.  But there is another feature (above it) that does have the *primary* tag 'gene' and same locus_tag and dbxref GeneID (it does have tags such as 'locus_tag', 'trans_splicing', etc).  Is that what you mean?
> 
> The way that BioPerl deals with this is to return two different independent features, one with the 'CDS' primary tag and one with the 'gene' primary tag.  Past attempts to somehow combine these (and then disambiguate them again later if needed for output) can be problematic, or at least they once were.
> 
> chris
> 
> 
> On Feb 24, 2012, at 3:55 PM, Adlai Burman wrote:
> 
>> Jason,
>> Your first solution, indeed, did the trick (though I'm not sure why). There was no need to for checking "else." I'm not sure why some records with a full set of "gene" tags would not parse without the check, but everything parsed with it.
>> 
>> Brian, you were right.
>> 
>> Thanks again,
>> 
>> Adlai
>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>> 
>>> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>>> 
>>> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>>> 
>>> my %strands;
>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>> if( $cds->has_tag('gene') ) {
>>> 	my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
>>> 	$strands{$gene} = $cds->strand; 
>>> } else { # look in alternative places for a name, e.g. locus, 
>>> ...
>>> }
>>> }
>>> 
>>> An alternative is to loop through your list of tags in order of preference
>>> 
>>> my %strands;
>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>> for my $tag ( qw(gene locus name product accession note) ) {  
>>> if( $cds->has_tag($tag) ) {
>>> 	my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
>>> 	$strands{$name} = $cds->strand;
>>>      $seen = 1;
>>>      last;
>>> }
>>> if( ! $seen ) { 
>>> 	warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
>>> }
>>> }
>>> 
>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>> 
>>>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>>> 
>>>> #!/usr/bin/perl                                                                                                     
>>>> use strict;
>>>> use warnings;
>>>> use IO::String;
>>>> use Bio::Perl;
>>>> use Bio::SeqIO;
>>>> use IO::String;
>>>> 
>>>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>>>> foreach my $file(@files){
>>>> 
>>>> 
>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit. 
>>>> .
>>>> .
>>>> .
>>>> #do nifty stuff
>>>> }
>>>> 
>>>> For some files this approach works just fine.
>>>> For others the script dies immediately with the error message:
>>>> 
>>>> ------------- EXCEPTION -------------
>>>> MSG: asking for tag value that does not exist gene
>>>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>> STACK toplevel tosend.pl:16
>>>> -------------------------------------
>>>> 
>>>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>>>> Does anyone know why this is a problem and what can be done to circumvent it?
>>>> 
>>>> Thanks,
>>>> Adlai
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>>> Jason Stajich
>>> jason.stajich at gmail.com
>>> jason at bioperl.org
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 





More information about the Bioperl-l mailing list