[Bioperl-l] Starting from scratch, a trivial problem?

Barry Moore barry.moore at genetics.utah.edu
Fri May 7 18:32:57 EDT 2004


James-

The SeqIO and the Features/Annotations HOWTO on the bioperl web site 
(http://www.bioperl.org/HOWTOs/) should be helpful. Below is a document 
that will fill in some of the gaps in the HOWTOs. Specifically it has 
some boilerplate code at the bottom that you can use to get sequences 
from NCBI, and then loop through them looking at the features and 
annotations of each one. Getting the gene name/locations and determining 
that your genes are sequential will be the part that you will have to 
add. It's Friday afternoon, and I was supposed to be home an hour ago, 
so hope this can at least get you started over the weekend. Let me know 
if you need more help.

Barry

--------------------------------------------------------------------------------------------

How to get values in a GenBank file with BioPerl

This is a guide to accessing the value of each item in a GenBank file 
with BioPerl.
Underneath each value in the sample GenBank file there is a number and a 
3-letter
code. The 3-letter code represents the object where the associated value is
stored. Table 1 at the end of this file gives the object for each code. By
looking up the numbers in Table 2 at the end of this file you will find 
the object
where this information can be called from, and the method to get (and 
usually set)
the value. Some values may be stored by, and accessible from methods in 
more than
one object. There is perl code at the end that you can use as a starting 
point to
get at all of the objects and their methods.
-----------------------------------------------------------------------------------
LOCUS NM_079543 1821 bp mRNA linear INV 10-DEC-2003
1-BSR 2-BPS 3-BSR 4-BSR 5-BSR 6-BSR
DEFINITION Drosophila melanogaster alpha-Esterase-3 CG1257-PA (alpha-Est3)
mRNA, complete cds.
7-BPS
ACCESSION NM_079543 XX_123456
8-BPS 9-BSR
VERSION NM_079543.2 GI:24644853
| |
10-BPS 11-BSR
KEYWORDS Esterases
12-BSR
SOURCE Drosophila melanogaster (fruit fly)
13-BSC
ORGANISM Drosophila melanogaster
14-BSC 15-BSC
Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
Ephydroidea; Drosophilidae; Drosophila.
16-BSC
REFERENCE 1 (bases 1 to 1821)
17-BAR 18-BAR
AUTHORS Adams,M.D. et al.(additional authors deleted for brevity)
19-BAR
TITLE The genome sequence of Drosophila melanogaster
20-BAR
JOURNAL Science 287 (5461), 2185-2195 (2000)
21-BAR
MEDLINE 20196006
22-BAR
PUBMED 10731132
23-BAR
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. This record is derived from an annotated genomic
sequence (NT_033777). The reference sequence was derived from
CG1257-RA.
On Nov 6, 2002 this sequence version replaced gi:17737826.
24-BAC
FEATURES Location/Qualifiers
source 1..1821
25-BLS 26-BLS
/organism="Drosophila melanogaster"
27-BSG
/mol_type="mRNA"
/db_xref="taxon:7227"
/chromosome="3R"
/note="genotype: y[1]; cn[1] bw[1] sp[1]; Rh6[1]"
gene 1..1821
/gene="alpha-Est3"
/locus_tag="CG1257"
/note="alpha-Esterase-3; synonyms: B, aE3, CG1257,
alphaE3, alpha-Ests, fragment B;
go_function: carboxylesterase activity [goid 0004091]
[evidence NAS]"
/map="84D9-84D9"
/db_xref="FLYBASE:FBgn0015571"
/db_xref="GeneID:40907"
/db_xref="LocusID:40907"
CDS 119..1750
/gene="alpha-Est3"
/locus_tag="CG1257"
/EC_number="3.1.1.1"
/note="alpha-Est3 gene product"
/codon_start=1
/product="alpha-Esterase-3 CG1257-PA"
/protein_id="NP_524267.2"
/db_xref="GI:24644854"
/db_xref="FLYBASE:FBgn0015571"
/db_xref="GeneID:40907"
/db_xref="LocusID:40907"
/translation="MESLQVNTTSGPVLGKQCTGVYGDEYVSFERIPYAQPPVGHLRF
MAPLPVEPWSQPLDCTKPGQKPLQFNHYSKQLEGVEDCLYLNVYAKELDSPRPLPLIV
FFFGGGFEKGDPTKELHSPDYFMMRDVVVVTVSYRVGPLGFLSLNDPAVGVPGNAGLK
DQLLAMEWIKENAERFNGDPKNVTAFGESAGAASVHYLMLNPKAEGLFHKAILQSGNV
LCSWALCTIKNLPHRLAVNLGMESAEHVTDAMVLDFLQKLPGEKLVRPYLLSAEEHLD
DCVFQFGPMVEPYKTEHCALPNHPQELLDKAWGNRIPVLMSGTSFEGLLMYARVQMAP
YLLTSLKKEPEHMLPLDVKRNLPQALARHLGQRLQETHFGGNDPSAMSPESLKAYCEY
ASYKVFWHPILKTLRSRVKSSSASTYLYRFDFDSPTFNHQRLKYCGDKLRGVAHVDDH
SYLWYGDFSWKLDKHTPEFLTIERMIDMLTSFARTSNPNCKLIQDQLPRAKEWKPLNS
KSALECLNISENIKMMELPELQKLRVWESVCQSTG"
ORIGIN
1 gtactatggc aaatggtact atgggcagtc gcgataataa taataaatga ggctttaaat
61 gtttttccac tacaagataa aaagttacga gtgcgcaccg agcatttagt agacgaacat
28-BPS
... (additional sequence deleted for brevity)
//
-----------------------------------------------------------------------------------
Table 1. Object Key
BPS - Bio::PrimarySeq
BSR - Bio::Seq::RichSeq
BSC - Bio::Species
BAC - Bio::Annotation::Comment
BAR - Bio::Annotation::Reference
BLS – Bio::Location::Simple
BSG - Bio::SeqFeature::Generic
-----------------------------------------------------------------------------------
Table 2. Usage key
1. Bio::Seq::RichSeq $scalar = $self->display_name() or ->display_id()
2. Bio::PrimarySeq $scalar = $self->length()
3. Bio::Seq::RichSeq $scalar = $self->molecule()
4. Bio::PrimarySeq $scalar = $self->is_circular
5. Bio::Seq::RichSeq $scalar = $self->division()
6. Bio::seq::Richseq @array = $self->get_dates
7. Bio::PrimarySeq $scalar = $self->description()
8. Bio::PrimarySeq $scalar = $self->accession_number()
Bio::Seq::RichSeq $scalar = $self->accession()
9. Bio::Seq::RichSeq @array = $self->get_secondary_accessions
10. Bio::PrimarySeq $scalar = $self->version()
Bio::Seq::RichSeq $scalar = $self->seq_version()
11. Bio::PrimarySeq $scalar = $self->primary_id()
12. Bio::Seq::RichSeq @array = $self->get_keywords
13. Bio::Species $scalar = $self->common_name()
14. Bio::Species $scalar = $self->genus()
15. Bio::Species $scalar = $self->species()
16. Bio::Species @array = $self->classification()
17. Bio::Annotation::Reference $scalar = $self->start()
18. Bio::Annotation::Reference $scalar = $self->end()
19. Bio::Annotation::Reference $scalar = $self->authors()
20. Bio::Annotation::Reference $scalar = $self->title()
21. Bio::Annotation::Reference $scalar = $self->location()
22. Bio::Annotation::Reference $scalar = $self->medline()
23. Bio::Annotation::Reference $scalar = $self->pubmed()
24. Bio::Annotation::Comment $scalar = $self->text() or ->as_text
25. Bio::SeqFeature::Generic $scalar = $self->start()
26. Bio::SeqFeature::Generic $scalar = $self->end()
27. Bio::SeqFeature::Generic

Get all features from the features table using a variation on the code 
below. You
will need to vary the primary_tag (e.g. source, gene, CDS) and the 
get_tag_values
(e.g. organism, db_xref, chromosome, note, gene, locus_tag, go_function, 
map,
EC_number, codon_start, protein_id, etc.). If there is more than one 
primary tag
associated with a give value (e.g. more than one gene) then you will 
have to step
through the @source_feats array to find what you want rather than simply 
shifting
the first scalar off the top.

my @source_feats = grep { $_->primary_tag eq 'source' } 
$seq->get_SeqFeatures();
my $source_feat = shift @source_feats;
my @mol_type = $source_feat->get_tag_values('mol_type');

28. Bio::PrimarySeq $scalar = $self->seq()
-----------------------------------------------------------------------------------
#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed

#Get some sequence IDs either like below, or read in from a file. Note that
#this sample script works with the accession numbers below (at least at 
the time
#it was written). If you add different accession numbers, and you get 
errors,
#you may be calling for something that the sequence doesn't have. You'll 
have
#to add your own error trapping code to handle that.
my @ids = ('U59228', 'AB039327', 'BC035972');

#Create the GenBank database object to read from the database.
my $gb = new Bio::DB::GenBank();

#Create a sequence stream to pass the sequences from the database to the 
program.
my $seqio = $gb->get_Stream_by_id(\@ids);

#Loop over all of the sequences that you requested.
while (my $seq = $seqio->next_seq) {

#Here is how you get methods directly from the RichSeq object. Replace
#'display_name' with any other method in Table 2. that can be called on
#either the RichSeq object directly, or the PrimarySeq object which it has
#inherited.
print $seq->display_name,"\n";

#Here is how to access the classification data from the species object.
my $species = $seq->species;
print $species->common_name,"\n";
my @class = $species->classification;
print "@class\n";

#Here is a general way to call things that are stored as a Bio::SeqFeature::
#Generic object. Replace 'source' with any other of the "major" headings in
#the feature table (e.g gene, CDS, etc.) and replace 'organism' with any of
#the tag values found under that heading (mol_type, locus_tag, gene, etc.)
my @source_feats = grep { $_->primary_tag eq 'source' } 
$seq->get_SeqFeatures();
my $source_feat = shift @source_feats;
my @mol_type = $source_feat->get_tag_values('mol_type');
print "@mol_type\n";

#Here is a general way to call things that are stored as some type of a
#Bio::Annotation oject. This includes reference information, and comments.
#Replace reference with 'comment' to get the comment, and replace
#$ref->authors with $ref->title (or location, medline, etc.) to get other
#reference categories
my $ann = $seq->annotation();
my @references = ($ann->get_Annotations('reference'));
my $ref = shift @references;
my ($title, $authors, $location, $pubmed, $reference);
if (defined $ref) {
$authors = $ref->authors;
print "$authors\n";
}
print "\n";
}
---------------------------------------------------------------------
This file is without copyright. It may be reproduced, edited, and 
redistributed at
will without notice to the author, however I would appreciate it if any 
corrections
or improvements were sent to barry.moore at genetics.utah.edu

JAMES IBEN wrote:

> Hello. I'm relatively new to programming in perl to begin with, so I 
> apologize for the foolish question.
> I would like to write a program to pull genomic sequences and clip out 
> a specific intergenic region in species where it exists, but I am 
> running into difficulty trying to incorporate the bioperl modules into 
> my script to accomplish this. This coupled with a grasp of only the 
> the ability to script the most rudimentary programs has left me at a 
> bit of a loss.
> I believe the program should run with this rough outline unless 
> someone has a more informed opinion:
>
> -Obtain sequence from online databank
> -Search annotations for sequential occurrence of gene X and gene Y (or 
> gene Y and gene X)
> -Print to output file Sequence ID and sequence occurring between 
> addresses of X and Y
> -loop to next sequence
>
> I would be very grateful if someone could please point me in the 
> direction of perhaps a similar example script or a lower-level 
> resource. It seems like it would be a fairly trivial problem.
>
> Thanks for your time,
> James
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l


-- 
Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT




More information about the Bioperl-l mailing list