[Bioperl-l] Parsing features

Barry Moore barry.moore at genetics.utah.edu
Fri Mar 12 12:24:26 EST 2004


Jonathon,

I looks like what you are trying to do is print your @features array as 
if it were a tidy list of features.  In fact it is a complex data 
structure that holds the values of the Generic SeqFeature object.  You 
will need to step through it and collect the goodies yourself.  (At 
least I don't think there is a prepackaged way to just "get it all" - 
although the list can correct me if I'm wrong on that.)  If you are 
after certain values of the GenBank features table, the following 
document may be of some help in getting at it.  (If the pasted document 
doesn't come to you well formatted let me know, and I can mail it 
directly to you as an attachement.)

Barry

---------------------------------------------------------------------------


                 How to get values in a GenBank file with BioPerl
 
This is a guide to accessing the value of each item in a GenBank file 
with BioPerl.  Underneath each value in the sample GenBank file there is 
a number and a 3-letter code.  The 3-letter code represents the object 
where the associated value is stored.  Table 1 at the end of this file 
gives the object for each code.  By looking up the numbers in Table 2 at 
the end of this file you will find the object where this information can 
be called from, and the method to get (and sometimes set) the value.  
Some values may be stored, and accessible by methods in more than one 
object.  There is perl code at the end that you can use as a starting 
point to get at all of the objects and their methods.
----------------------------------------------------------------------------------- 

LOCUS       NM_079543               1821 bp    mRNA    linear   INV 
10-DEC-2003
            1-BSR                   2-BPS      3-BSR   4-BSR  5-BSR 6-BSR
DEFINITION  Drosophila melanogaster alpha-Esterase-3 CG1257-PA (alpha-Est3)
            mRNA, complete cds.
            7-BPS
ACCESSION   NM_079543 XX_123456
            8-BPS     9-BSR
VERSION     NM_079543.2  GI:24644853
                      |     |
                    10-BPS  11-BSR
KEYWORDS    Esterases
            12-BSR
SOURCE      Drosophila melanogaster (fruit fly)
            13-BSC
  ORGANISM  Drosophila melanogaster
            14-BSC     15-BSC
            Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
            Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
            Ephydroidea; Drosophilidae; Drosophila.
            16-BSC
REFERENCE   1  (bases 1 to 1821)
                    17-BAR 18-BAR
  AUTHORS   Adams,M.D. et al.(additional authors deleted for brevity)
            19-BAR
  TITLE     The genome sequence of Drosophila melanogaster
            20-BAR
  JOURNAL   Science 287 (5461), 2185-2195 (2000)
            21-BAR
  MEDLINE   20196006
            22-BAR
   PUBMED   10731132
            23-BAR
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to 
final
            NCBI review. This record is derived from an annotated genomic
            sequence (NT_033777). The reference sequence was derived from
            CG1257-RA.
            On Nov 6, 2002 this sequence version replaced gi:17737826.
            24-BAC
FEATURES             Location/Qualifiers
     source          1..1821
                25-BLS  26-BLS
                     /organism="Drosophila melanogaster"
                     27-BSG
                     /mol_type="mRNA"
                     /db_xref="taxon:7227"
                     /chromosome="3R"
                     /note="genotype: y[1]; cn[1] bw[1] sp[1]; Rh6[1]"
     gene            1..1821
                     /gene="alpha-Est3"
                     /locus_tag="CG1257"
                     /note="alpha-Esterase-3; synonyms: B, aE3, CG1257,
                     alphaE3, alpha-Ests, fragment B;
                     go_function: carboxylesterase activity [goid 0004091]
                     [evidence NAS]"
                     /map="84D9-84D9"
                     /db_xref="FLYBASE:FBgn0015571"
                     /db_xref="GeneID:40907"
                     /db_xref="LocusID:40907"
     CDS             119..1750
                     /gene="alpha-Est3"
                     /locus_tag="CG1257"
                     /EC_number="3.1.1.1"
                     /note="alpha-Est3 gene product"
                     /codon_start=1
                     /product="alpha-Esterase-3 CG1257-PA"
                     /protein_id="NP_524267.2"
                     /db_xref="GI:24644854"
                     /db_xref="FLYBASE:FBgn0015571"
                     /db_xref="GeneID:40907"
                     /db_xref="LocusID:40907"
                     
/translation="MESLQVNTTSGPVLGKQCTGVYGDEYVSFERIPYAQPPVGHLRF
                     
MAPLPVEPWSQPLDCTKPGQKPLQFNHYSKQLEGVEDCLYLNVYAKELDSPRPLPLIV
                   
  FFFGGGFEKGDPTKELHSPDYFMMRDVVVVTVSYRVGPLGFLSLNDPAVGVPGNAGLK
                     
DQLLAMEWIKENAERFNGDPKNVTAFGESAGAASVHYLMLNPKAEGLFHKAILQSGNV
                     
LCSWALCTIKNLPHRLAVNLGMESAEHVTDAMVLDFLQKLPGEKLVRPYLLSAEEHLD
                     
DCVFQFGPMVEPYKTEHCALPNHPQELLDKAWGNRIPVLMSGTSFEGLLMYARVQMAP
                     
YLLTSLKKEPEHMLPLDVKRNLPQALARHLGQRLQETHFGGNDPSAMSPESLKAYCEY
                     
ASYKVFWHPILKTLRSRVKSSSASTYLYRFDFDSPTFNHQRLKYCGDKLRGVAHVDDH
                     
SYLWYGDFSWKLDKHTPEFLTIERMIDMLTSFARTSNPNCKLIQDQLPRAKEWKPLNS
                     KSALECLNISENIKMMELPELQKLRVWESVCQSTG"
ORIGIN     
        1 gtactatggc aaatggtact atgggcagtc gcgataataa taataaatga ggctttaaat
       61 gtttttccac tacaagataa aaagttacga gtgcgcaccg agcatttagt agacgaacat
          28-BPS
... (additional sequence deleted for brevity)
//
----------------------------------------------------------------------------------- 

Table 1. Object Key
BPS - Bio::PrimarySeq
BSR - Bio::Seq::RichSeq
BSC - Bio::Species
BAC - Bio::Annotation::Comment
BAR - Bio::Annotation::Reference
BLS ? Bio::Location::Simple
BSG - Bio::SeqFeature::Generic
----------------------------------------------------------------------------------- 

Table 2. Usage key
1.   Bio::Seq::RichSeq           $scalar = $self->display_name() or 
->display_id()
2.   Bio::PrimarySeq             $scalar = $self->length()
3.   Bio::Seq::RichSeq           $scalar = $self->molecule()
4.   Bio::PrimarySeq             $scalar = $self->is_circular
5.   Bio::Seq::RichSeq           $scalar = $self->division()
6.   Bio::seq::Richseq           @array  = $self->get_dates
7.   Bio::PrimarySeq             $scalar = $self->description()
8.   Bio::PrimarySeq             $scalar = $self->accession_number()
     Bio::Seq::RichSeq           $scalar = $self->accession()
9.   Bio::Seq::RichSeq           @array  = $self->get_secondary_accessions
10.  Bio::PrimarySeq             $scalar = $self->version()
     Bio::Seq::RichSeq           $scalar = $self->seq_version()
11.  Bio::PrimarySeq             $scalar = $self->primary_id()
12.  Bio::Seq::RichSeq           @array  = $self->get_keywords
13.  Bio::Species                $scalar = $self->common_name() 
14.  Bio::Species                $scalar = $self->genus()
15.  Bio::Species                $scalar = $self->species()
16.  Bio::Species                @array  = $self->classification()
17.  Bio::Annotation::Reference  $scalar = $self->start()
18.  Bio::Annotation::Reference  $scalar = $self->end()
19.  Bio::Annotation::Reference  $scalar = $self->authors()
20.  Bio::Annotation::Reference  $scalar = $self->title()
21.  Bio::Annotation::Reference  $scalar = $self->location()
22.  Bio::Annotation::Reference  $scalar = $self->medline()
23.  Bio::Annotation::Reference  $scalar = $self->pubmed()
24.  Bio::Annotation::Comment    $scalar = $self->text() or ->as_text
25.  Bio::SeqFeature::Generic    $scalar = $self->start()
26.  Bio::SeqFeature::Generic    $scalar = $self->end()
27.  Bio::SeqFeature::Generic
 
Get all features from the features table using a variation on the code 
below.  You will need to vary the primary_tag (e.g. source, gene, CDS) 
and the get_tag_values (e.g. organism, db_xref, chromosome, note, gene, 
locus_tag, go_function, map, EC_number, codon_start, protein_id, etc.).  
If there is more than one primary tag associated with a give value (e.g. 
more than one gene) then you will have to step through the @source_feats 
array to find what you want rather than simply shifting the first scalar 
off the top.
 
my @source_feats = grep { $_->primary_tag eq 'source' } 
$seq->get_SeqFeatures();
my $source_feat = shift @source_feats;
my @mol_type = $source_feat->get_tag_values('mol_type');
 
28.   Bio::PrimarySeq   $scalar =   $self->seq()
----------------------------------------------------------------------------------- 

#!/usr/bin/perl
 
use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed
 
#Get some sequence IDs either like below, or read in from a file.  Note 
that
#this sample script works with the accession numbers below (at least at 
the time
#it was written).  If you add different accession numbers, and you get 
errors,
#you may be calling for something that the sequence doesn't have.  
You'll have
#to add your own error trapping code to handle that.
my @ids = ('U59228', 'AB039327', 'BC035972');
 
#Create the GenBank database object to read from the database.
my $gb = new Bio::DB::GenBank();
 
#Create a sequence stream to pass the sequences from the database to the 
program.
my $seqio = $gb->get_Stream_by_id(\@ids);
 
#Loop over all of the sequences that you requested.
while (my $seq = $seqio->next_seq) {
 
  #Here is how you get methods directly from the RichSeq object.  Replace
  #'display_name' with any other method in Table 2. that can be called on
  #either the RichSeq object directly, or the PrimarySeq object which it 
has
  #inherited.
  print $seq->display_name,"\n";
 
  #Here is how to access the classification data from the species object.
  my $species = $seq->species;
  print $species->common_name,"\n";
  my @class = $species->classification;
  print "@class\n";
 
  #Here is a general way to call things that are stored as a 
Bio::SeqFeature::
  #Generic object.  Replace 'source' with any other of the "major" 
headings in
  #the feature table (e.g gene, CDS, etc.) and replace 'organism' with 
any of
  #the tag values found under that heading (mol_type, locus_tag, gene, 
etc.)
  my @source_feats = grep { $_->primary_tag eq 'source' } 
$seq->get_SeqFeatures();
  my $source_feat = shift @source_feats;
  my @mol_type = $source_feat->get_tag_values('mol_type');
  print "@mol_type\n";
 
  #Here is a general way to call things that are stored as some type of a
  #Bio::Annotation oject.  This includes reference information, and 
comments.
  #Replace reference with 'comment' to get the comment, and replace
  #$ref->authors with $ref->title (or location, medline, etc.) to get other
  #reference categories
  my $ann = $seq->annotation();
  my @references = ($ann->get_Annotations('reference'));
  my $ref = shift @references;
  my ($title, $authors, $location, $pubmed, $reference);
  if (defined $ref) {
    $authors = $ref->authors;
    print "$authors\n";
  }
  print "\n";
}
---------------------------------------------------------------------
This file is without copyright.  It may be reproduced, edited, and 
redistributed at will without notice to the author, however I would 
appreciate it if any corrections or improvements were sent to 
barry.moore at genetics.utah.edu



More information about the Bioperl-l mailing list