[Bioperl-l] Getting values from a GenBank File
Barry Moore
barry.moore at genetics.utah.edu
Fri Mar 5 09:30:06 EST 2004
O.K. Sounds good.
Barry
Brian Osborne wrote:
>Barry,
>
>This is good. I've also been talking to another Bioperl-er who's writing a
>script that creates an output very much like your Tables, where the input is
>any type of format (Genbank, Locuslink, and so on). So you could write your
>own HOWTO, yes, but in my opinion the better idea is to marry the existing
>HOWTO, your tables, and this incoming script somehow.
>
>I'm convinced that having too much on a single topic ("you could read the
>FAQ 2.6 or the HOWTO or the module documentation for Bio::That or Bio::This
>or the bptutorial section 3.4...") is not good. Bioperl is under-documented,
>generally, but you'll also find places where there's documentation scattered
>all over the place on a given topic. The advice is inconsistent, one section
>gets outdated, the authors move on, these sorts of things happen.
>
>My suggestion would be to wait a moment until my colleague's script appears,
>then let's write something coherent that's easy to maintain.
>
>Brian O.
>
>-----Original Message-----
>From: bioperl-l-bounces at portal.open-bio.org
>[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Barry Moore
>Sent: Thursday, March 04, 2004 3:31 PM
>To: bioperl
>Subject: [Bioperl-l] Getting values from a GenBank File
>
>While learning BioPerl, I've spent a good bit of time trying to figure
>out where some bit of information in a GenBank file might be tucked away
>in that RichSeq object, and how the hell to get at it. I finally
>decided it was time to sift through that object and find out what's in
>there and where it is. Along the way I wrote down what I found, and
>then added a bit to it to try and turn it into something that I would
>like to have found when I first started learning Bioperl. The result is
>a "GeneBank-file-centric" crib sheet for Bioperl. The document is
>centered around a "standard" GenBank file where each seperate item of
>information is marked-up with numbers and codes that refer to
>information later in the document about how to get at that particular
>value in using BioPerl. It also includes a script at the end that can
>be modified easily to get at all the major stuff in the GenBank file.
>This document is aimed at beginners to BioPerl, but assumes a working
>knowledge of Perl. The file is, of course, free to be used by anyone in
>anyway that they wish. I'd appreciate any corrections or improvements
>if you find errors or shortcomings.
>
>Barry
>--
>
>Barry Moore
>Dept. of Human Genetics
>University of Utah
>Salt Lake City, UT
>
>----------------------------------------------------------------------------
>-------
>
> How to get values in a GenBank file with BioPerl
>
>This is a guide to accessing the value of each item in a GenBank file
>with BioPerl.
>Underneath each value in the sample GenBank file there is a number and a
>3-letter
>code. The 3-letter code represents the object where the associated
>value is stored.
>Table 1 at the end of this file gives the object for each code. By
>looking up the
>numbers in Table 2 at the end of this file you will find the object
>where this
>information can be called from, and the method to get (and usually set)
>the value.
>Some values may be stored by, and accessible by methods in more than one
>object.
>There is perl code at the end that you can use as a starting point to
>get at all of
>the objects and their methods.
>----------------------------------------------------------------------------
>-------
>LOCUS NM_079543 1821 bp mRNA linear INV
>10-DEC-2003
> 1-BSR 2-BPS 3-BSR 4-BSR 5-BSR 6-BSR
>DEFINITION Drosophila melanogaster alpha-Esterase-3 CG1257-PA (alpha-Est3)
> mRNA, complete cds.
> 7-BPS
>ACCESSION NM_079543 XX_123456
> 8-BPS 9-BSR
>VERSION NM_079543.2 GI:24644853
> | |
> 10-BPS 11-BSR
>KEYWORDS Esterases
> 12-BSR
>SOURCE Drosophila melanogaster (fruit fly)
> 13-BSC
> ORGANISM Drosophila melanogaster
> 14-BSC 15-BSC
> Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
> Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
> Ephydroidea; Drosophilidae; Drosophila.
> 16-BSC
>REFERENCE 1 (bases 1 to 1821)
> 17-BAR 18-BAR
> AUTHORS Adams,M.D. et al.(additional authors deleted for brevity)
> 19-BAR
> TITLE The genome sequence of Drosophila melanogaster
> 20-BAR
> JOURNAL Science 287 (5461), 2185-2195 (2000)
> 21-BAR
> MEDLINE 20196006
> 22-BAR
> PUBMED 10731132
> 23-BAR
>COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to
>final
> NCBI review. This record is derived from an annotated genomic
> sequence (NT_033777). The reference sequence was derived from
> CG1257-RA.
> On Nov 6, 2002 this sequence version replaced gi:17737826.
> 24-BAC
>FEATURES Location/Qualifiers
> source 1..1821
> 25-BLS 26-BLS
> /organism="Drosophila melanogaster"
> 27-BSG
> /mol_type="mRNA"
> /db_xref="taxon:7227"
> /chromosome="3R"
> /note="genotype: y[1]; cn[1] bw[1] sp[1]; Rh6[1]"
> gene 1..1821
> /gene="alpha-Est3"
> /locus_tag="CG1257"
> /note="alpha-Esterase-3; synonyms: B, aE3, CG1257,
> alphaE3, alpha-Ests, fragment B;
> go_function: carboxylesterase activity [goid 0004091]
> [evidence NAS]"
> /map="84D9-84D9"
> /db_xref="FLYBASE:FBgn0015571"
> /db_xref="GeneID:40907"
> /db_xref="LocusID:40907"
> CDS 119..1750
> /gene="alpha-Est3"
> /locus_tag="CG1257"
> /EC_number="3.1.1.1"
> /note="alpha-Est3 gene product"
> /codon_start=1
> /product="alpha-Esterase-3 CG1257-PA"
> /protein_id="NP_524267.2"
> /db_xref="GI:24644854"
> /db_xref="FLYBASE:FBgn0015571"
> /db_xref="GeneID:40907"
> /db_xref="LocusID:40907"
>
>/translation="MESLQVNTTSGPVLGKQCTGVYGDEYVSFERIPYAQPPVGHLRF
>
>MAPLPVEPWSQPLDCTKPGQKPLQFNHYSKQLEGVEDCLYLNVYAKELDSPRPLPLIV
>
>FFFGGGFEKGDPTKELHSPDYFMMRDVVVVTVSYRVGPLGFLSLNDPAVGVPGNAGLK
>
>DQLLAMEWIKENAERFNGDPKNVTAFGESAGAASVHYLMLNPKAEGLFHKAILQSGNV
>
>LCSWALCTIKNLPHRLAVNLGMESAEHVTDAMVLDFLQKLPGEKLVRPYLLSAEEHLD
>
>DCVFQFGPMVEPYKTEHCALPNHPQELLDKAWGNRIPVLMSGTSFEGLLMYARVQMAP
>
>YLLTSLKKEPEHMLPLDVKRNLPQALARHLGQRLQETHFGGNDPSAMSPESLKAYCEY
>
>ASYKVFWHPILKTLRSRVKSSSASTYLYRFDFDSPTFNHQRLKYCGDKLRGVAHVDDH
>
>SYLWYGDFSWKLDKHTPEFLTIERMIDMLTSFARTSNPNCKLIQDQLPRAKEWKPLNS
> KSALECLNISENIKMMELPELQKLRVWESVCQSTG"
>ORIGIN
> 1 gtactatggc aaatggtact atgggcagtc gcgataataa taataaatga ggctttaaat
> 61 gtttttccac tacaagataa aaagttacga gtgcgcaccg agcatttagt agacgaacat
> 28-BPS
>... (additional sequence deleted for brevity)
>//
>----------------------------------------------------------------------------
>-------
>Table 1. Object Key
>BPS - Bio::PrimarySeq
>BSR - Bio::Seq::RichSeq
>BSC - Bio::Species
>BAC - Bio::Annotation::Comment
>BAR - Bio::Annotation::Reference
>BLS - Bio::Location::Simple
>BSG - Bio::SeqFeature::Generic
>----------------------------------------------------------------------------
>-------
>Table 2. Usage key
>1. Bio::Seq::RichSeq $scalar = $self->display_name() or
>->display_id()
>2. Bio::PrimarySeq $scalar = $self->length()
>3. Bio::Seq::RichSeq $scalar = $self->molecule()
>4. Bio::PrimarySeq $scalar = $self->is_circular
>5. Bio::Seq::RichSeq $scalar = $self->division()
>6. Bio::seq::Richseq @array = $self->get_dates
>7. Bio::PrimarySeq $scalar = $self->description()
>8. Bio::PrimarySeq $scalar = $self->accession_number()
> Bio::Seq::RichSeq $scalar = $self->accession()
>9. Bio::Seq::RichSeq @array = $self->get_secondary_accessions
>10. Bio::PrimarySeq $scalar = $self->version()
> Bio::Seq::RichSeq $scalar = $self->seq_version()
>11. Bio::PrimarySeq $scalar = $self->primary_id()
>12. Bio::Seq::RichSeq @array = $self->get_keywords
>13. Bio::Species $scalar = $self->common_name()
>14. Bio::Species $scalar = $self->genus()
>15. Bio::Species $scalar = $self->species()
>16. Bio::Species @array = $self->classification()
>17. Bio::Annotation::Reference $scalar = $self->start()
>18. Bio::Annotation::Reference $scalar = $self->end()
>19. Bio::Annotation::Reference $scalar = $self->authors()
>20. Bio::Annotation::Reference $scalar = $self->title()
>21. Bio::Annotation::Reference $scalar = $self->location()
>22. Bio::Annotation::Reference $scalar = $self->medline()
>23. Bio::Annotation::Reference $scalar = $self->pubmed()
>24. Bio::Annotation::Comment $scalar = $self->text() or ->as_text
>25. Bio::SeqFeature::Generic $scalar = $self->start()
>26. Bio::SeqFeature::Generic $scalar = $self->end()
>27. Bio::SeqFeature::Generic
>
>Get all features from the features table using a variation on the code
>below.
>You will need to vary the primary_tag (e.g. source, gene, CDS) and the
>get_tag_values (e.g. organism, db_xref, chromosome, note, gene, locus_tag,
>go_function, map, EC_number, codon_start, protein_id, etc.). If there
>is more
>than one primary tag associated with a give value (e.g. more than one
>gene) then
>you will have to step through the @source_feats array to find what you
>want rather
>than simply shifting the first scalar off the top.
>
>my @source_feats = grep { $_->primary_tag eq 'source' }
>$seq->get_SeqFeatures();
>my $source_feat = shift @source_feats;
>my @mol_type = $source_feat->get_tag_values('mol_type');
>
>28. Bio::PrimarySeq $scalar = $self->seq()
>----------------------------------------------------------------------------
>-------
>#!/usr/bin/perl
>
>use strict;
>use warnings;
>use Bio::SeqIO;
>use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed
>
>#Get some sequence IDs either like below, or read in from a file. Note that
>#this sample script works with the accession numbers below (at least at
>the time
>#it was written). If you add different accession numbers, and you get
>errors,
>#you may be calling for something that the sequence doesn't have.
>You'll have
>#to add your own error trapping code to handle that.
>my @ids = ('U59228', 'AB039327', 'BC035972');
>
>#Create the GenBank database object to read from the database.
>my $gb = new Bio::DB::GenBank();
>
>#Create a sequence stream to pass the sequences from the database to the
>program.
>my $seqio = $gb->get_Stream_by_id(\@ids);
>
>#Loop over all of the sequences that you requested.
>while (my $seq = $seqio->next_seq) {
>
> #Here is how you get methods directly from the RichSeq object. Replace
> #'display_name' with any other method in Table 2. that can be called on
> #either the RichSeq object directly, or the PrimarySeq object which it has
> #inherited.
> print $seq->display_name,"\n";
>
> #Here is how to access the classification data from the species object.
> my $species = $seq->species;
> print $species->common_name,"\n";
> my @class = $species->classification;
> print "@class\n";
>
> #Here is a general way to call things that are stored as a
>Bio::SeqFeature::
> #Generic object. Replace 'source' with any other of the "major"
>headings in
> #the feature table (e.g gene, CDS, etc.) and replace 'organism' with
>any of
> #the tag values found under that heading (mol_type, locus_tag, gene, etc.)
> my @source_feats = grep { $_->primary_tag eq 'source' }
>$seq->get_SeqFeatures();
> my $source_feat = shift @source_feats;
> my @mol_type = $source_feat->get_tag_values('mol_type');
> print "@mol_type\n";
>
> #Here is a general way to call things that are stored as some type of a
> #Bio::Annotation oject. This includes reference information, and
>comments.
> #Replace reference with 'comment' to get the comment, and replace
> #$ref->authors with $ref->title (or location, medline, etc.) to get other
> #reference categories
> my $ann = $seq->annotation();
> my @references = ($ann->get_Annotations('reference'));
> my $ref = shift @references;
> my ($title, $authors, $location, $pubmed, $reference);
> if (defined $ref) {
> $authors = $ref->authors;
> print "$authors\n";
> }
> print "\n";
>}
>----------------------------------------------------------------------------
>-------
>This file is without copyright. It may be reproduced, edited, and
>redistributed at
>will without notice to the author, however I would appreciate it if any
>corrections
>or improvements were sent to barry.moore at genetics.utah.edu
>
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>
--
Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT
More information about the Bioperl-l
mailing list