[Bioperl-l] Re: Entrez gene parser code
Barry Moore
barry.moore at genetics.utah.edu
Thu Apr 14 16:32:48 EDT 2005
Colin-
Did you want to install a new ppm repository on your Windows box so you
can install bioperl with ppm? If so, this is not via CVS. You want to
run the following commands from you ppm prompt.
rep add Bioperl http://bioperl.org/DIST/
rep add Kobes http://theoryx5.uwinnipeg.ca/ppms/
rep add Bribes http://www.bribes.org/perl/ppm
You can then search for bioperl and install the version you want.
Nathan Haigh (usually on this list) was preparing a bioperl 1.5 ppm.
Not sure if it made it onto the website yet, but 1.4 is there and that
install works as expected. BTW, this will only install bioperl core.
-------------------------------------------------------------------------------------------------------------------------------------------------
Installing Bioperl on Windows
=============================
1) Quick Instructions for the Impatient
2) Bioperl on Windows
3) Perl on Windows
4) BioPerl on Windows
5) Beyond the Core
6) BioPerl and Cygwin
7) Cygwin Tips
8) Example Script
This installation guide was written by Barry Moore, Nathan Haigh and
other Bioperl authors based on
the original work of Paul Boutros. Please report problems and/or fixes
to the bioperl mailing list,
bioperl-l at bioperl.org
1) Quick instructions for the impatient, lucky, or experienced user.
=====================================================================
Download the ActivePerl MSI from
http://www.activestate.com/Products/ActivePerl/
Run the ActivePerl Installer (accepting all defaults is fine).
Open a command prompt (Menus Start->Run and type cmd) and run the PPM
shell (C:\>ppm).
Add two new PPM repositories with the following commands:
ppm> rep add Bioperl http://bioperl.org/DIST
ppm> rep add Kobes http://theoryx5.uwinnipeg.ca/ppms
ppm> rep add Bribes http://www.Bribes.org/perl/ppm
Install Bioperl with the following commands:
ppm> search Bioperl
This returns a numbered list of packages with corresponding version
numbers etc. with "Bioperl" in
their name.
ppm> install <number>
Where <number> corresponds to the relevant package and version from the
numbered list obtained
above.
Go to http://www.bioperl.org and start reading documentation or try the
example script at the end of
this file.
2) Bioperl
======================
Bioperl is a large collection of Perl modules (extensions to the Perl
language) that aid in the task
of writing Perl code to deal with sequence data in a myriad of ways.
Bioperl provides objects for
various types of sequence data and their associated features and
annotations. It provides
interfaces for analysis of these sequences with a wide variety of
external programs (BLAST, fasta,
clustalw and EMBOSS to name just a few). It provides interfaces to
various types of databases both
remote (GenBank, EMBL etc) and local (MySQL, flat files, GFF etc.) for
storage and retrieval of
sequences. And finally with its associated documentation and mailing
list Bioperl represents a
community of bioinformatics professionals working in Perl who are
committed to supporting both
development of Bioperl and the new users who are drawn to the project.
While most bioinformatics and computational biology applications are
developed in Unix/Linux
environments, more and more programs are being ported to other operating
systems like Windows, and
many users (often biologists with little background in programming) are
looking for ways to
automate bioinformatics analyses in the Windows environment. Perl and
Bioperl can be installed
natively on Windows NT/2000/XP. Most of the functionality of Bioperl is
available with this type
of install. Much of the heavy lifting in bioinformatics is done by
programs originally developed in
lower level languages like C and Pascal (e.g. BLAST, clustalw, Staden
etc). Bioperl simply acts
as a wrapper for running and parsing output from these external
programs. Some of those programs
(BLAST for example) are ported to Windows. These can be installed and
work quite happily with BioPerl
in the native Windows environment. Some external programs such as
Staden and the EMBOSS
suite of programs can not be installed on Windows at all, and therefore
any part of Bioperl that
interacts with these packages either won't work or can't be installed at
all.
If you have a fairly simple project in mind, want to start using Bioperl
quickly, only have access
to a computer running Windows, and/or don't mind bumping up against some
limitations then Bioperl on
Windows may be a good place for you to start. For example, downloading
a bunch of sequences from
GenBank and sorting out the ones that have a particular annotation or
feature works great. Running
a bunch of your sequences against remote or local BLAST, parsing the
output and storing it in a
MySQL database would be fine also. Be aware that most if not all of the
Bioperl developers are
working in some type of a UNIX environment (Linux, OSX, Cygwin). If you
have problems with Bioperl
that are specific to the Windows environment, you may be blazing new
ground and your pleas for help
on the Bioperl mailing list may get few responses - simply because no
one knows the answer to your
Windows specific problem. If this is or becomes a problem for you then
you are better off working
in some type of UNIX like environment. One solution to this problem
that will keep you working on a
Windows machine it to install Cygwin, a UNIX emulation environment for
Windows. A number of Bioperl
users are using this approach successfully and it is discussed in more
detail below.
3) Perl on Windows
===================
There are a couple of ways of installing Perl on a Windows machine. The
most common and easiest is
to get the most recent build from ActiveState. ActiveState is a
software company
(http://www.activestate.com) that provides free builds of Perl for
Windows users. The current
(December 2004) build is ActivePerl 5.8.4.810 (ActivePerl 5.6.1.638 is
also available and should
work just fine). To install ActivePerl on Windows:
Download the ActivePerl MSI from
http://www.activestate.com/Products/ActivePerl/
Run the ActivePerl Installer (accepting all defaults is fine).
You can also build Perl yourself (which requires a C compiler) or
download one of the other binary
distributions. The Perl source for building it yourself is available from
CPAN (http://www.cpan.org), as are a few other binary distributions that
are alternatives to
ActiveState. This approach is not recommended unless you have specific
reasons for doing so and
know what you're doing. If that's the case you probably don't need to
be reading this guide.
Cygwin is a UNIX emulation environment for Windows and comes with its
own copy of Perl.
Information on Cygwin and Bioperl is found below.
4) BioPerl on Windows
======================
Perl is a programming language that has been extended a lot by the
addition of external modules.
These modules work with the core language to extend the functionality of
Perl.
Bioperl is one such extension to Perl. These modular extensions to Perl
sometimes depend on the
functionality of other Perl modules and this creates a dependency. You
can't install module X
unless you have already installed module Y. Some Perl modules are so
fundamentally useful that the
Perl developers have included them in the core distribution of Perl - if
you've installed Perl then
these modules are already installed. Other modules are freely available
from CPAN, but you'll have
to install them yourself if you want to use them. BioPerl has such
dependencies.
Bioperl is actually a large collection of Perl modules (over 1000
currently) and these modules are
split into six groups. These six groups are:
Bioperl Group Functions
-----------------------------------------------------------------
bioperl (the core) Most of the main functionality of Bioperl.
bioperl-run Wrappers to a lot of external programs.
bioperl-ext Interaction with some alignment functions
and the Staden package.
bioperl-db Using bioperl with BioSQL and local
relational databases.
bioperl-microarray Microarray specific functions.
biperl-gui Some preliminary work on a graphical user
interface to some Bioperl functions.
The Bioperl core is what most new users will want to start with.
Bioperl (the core) and the Perl
modules that it depends on can be easily installed with PPM. PPM
(Programmer's Package Manager formally known as the Perl Package
Manager) is an ActivePerl utility
for installing Perl modules on systems using ActivePerl. The PPM
commands shown in this document
are for PPM version 3, if you use PPM version 2 the commands you require
will be different. PPM
will look online (you have to be connected to the internet of course)
for files (these files end
with .ppd) that tell it how to install the modules you want and what
other modules your new modules
depends on. It will then download and install your modules and all
dependent modules for you.
These .ppd files are stored online in PPM repositories. ActiveState
maintains the largest PPM
repository and when you installed ActivePerl PPM was installed with
directions for using the
ActiveState repositories. Unfortunately the ActiveState repositories
are far from complete and
other ActivePerl users maintain their own PPM repositories to fill in
the gaps. Installing will
require you to direct PPM to look in three new repositories.
You do this by opening a Windows command prompt, typing ppm to start the
PPM shell and then typing
the following three commands:
ppm> rep add Bioperl http://bioperl.org/DIST
ppm> rep add Kobes http://theoryx5.uwinnipeg.ca/ppms ppm>
rep add Bribes
http://www.Bribes.org/perl/ppm
Once PPM knows where to look for Bioperl and it's dependencies you
simply tell PPM to search for
packages with Bioperl in their name, and then which of these to
install. This is done with the
following commands:
ppm> search Bioperl
This returns a numbered list of packages with corresponding version
numbers etc. with "Bioperl" in
their name.
ppm> install <number>
Where <number> corresponds to the relevant package and version from the
numbered list obtained
above.
5) Beyond the Core
===================
You may find that you want some of the features of other Bioperl groups
like bioperl-run or
bioperl-db. There are currently no PPM packages for installing these
parts of
Bioperl (but check this by doing a Bioperl search at the PPM shell):
ppm> search bioperl
If they are not present, you will have to install these manually from
source. For this you will
need a Windows version of the program make called nmake
(http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/Nmake15.exe).
You will also want
to have a willingness to experiment. You'll have to read the
installation documents for each
component that you want to install, and use nmake where the instructions
call for make. You will
have to determine from the installation documents what dependencies are
required and you will have
to get them, read there documentation and install them first. The
details of this are beyond the
scope of this guide. Read the documentation. Search Google. Try your
best, and if you get stuck
consult with others on the bioperl mailing list.
6) BioPerl and Cygwin
=====================
Cygwin is a UNIX emulator and shell environment available free at
www.cygwin.com. BioPerl runs well
within Cygwin. Some users claim that installation of Bioperl is easier
within
Cygwin than within Windows, but these may be users with UNIX backgrounds.
One advantage of using Bioperl in Cygwin is that all the external
modules are available through
CPAN, most if not all external programs can be installed and run so many
of the limitation of
Bioperl on Windows are circumvented.
To get Bioperl running first install the basic Cygwin package as well as
the Cygwin Perl, make, and
gcc packages. Clicking the "View" button in the upper right of the
installer enables you to see
details on the various packages. Then follow the BioPerl installation
instructions for UNIX in
BioPerl's INSTALL file.
Note that expat comes with Cygwin (it's used by the module XML::Parser).
One known issue is that DBD::mysql can be tricky to install in
Cygwin and this module is required for the bioperl-db, Biosql, and
bioperl-pipeline external
packages. Fortunately there's some good instructions online:
http://search.cpan.org/src/JWIED/DBD-mysql-2.1025/INSTALL.html#windows/cygwin.
Also, set the environmental variable TMPDIR, programs like BLAST and
clustalw need a place to create
temporary files. e.g.:
setenv TMPDIR e:/cygwin/tmp # csh, tcsh
export TMPDIR=e:/cygwin/tmp # sh, bash
Note that this is not a syntax that Cygwin understands, which would be
something like
"/cygdrive/e/cygwin/tmp". This is the syntax that a Perl module expects
on Windows.
If this variable is not set correctly you'll see errors like this when
you run
Bio::Tools::Run::StandAloneBlast:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not open /tmp/gXkwEbrL0a: No such file or directory
STACK: Error::throw
..........
7) Cygwin Tips
===============
The easiest way to install MySQL is to use the Windows binaries
available at www.mysql.com. Note
that Windows does not have sockets, so you need to force the MySQL
connections to use TCP/IP
instead. Do this by using the "-h" option from the command-line:
>mysql -h 127.0.0.1 -u blip -pblop biosql
Or, alias the mysql command in your .tcshrc, .cshrc, or .bashrc so it
uses a host. For example, if
your databases are installed locally:
alias mysql 'mysql -h 127.0.0.1'
If you're trying to use some application or resource "outside" of Cygwin
and you're having a problem
remember that Cygwin's path syntax may not be the correct one. Cygwin
understands '/home/jacky' or
'/cygdrive/e/cygwin/home/jacky' (when referring to the E: drive) but the
external resource may want
'E:/cygwin/home/jacky'. So your *rc files may end up with paths written
in these different syntaxes,
depending.
If you can, install Cygwin on a drive or partition that's
NTFS-formatted, not FAT32-formatted. When
you install Cygwin on a FAT32 partition you will not be able to set
permissions and ownership
correctly. In most situations this probably won't make any difference
but there may be occasions
where this is a problem.
If you want to use BLAST we recommend that the Windows binary be
obtained from NCBI
(ftp://ftp.ncbi.nih.gov/blast/executables/LATEST-BLAST - the file will
be named something like
blast-2.2.6-ia32-win32.exe). Then follow the Windows instructions in
README.bls.
Although we've recommended using the BLAST and MySQL binaries you should
be able to compile just
about everything else from source code using Cygwin's gcc. You'll notice
when you're installing
Cygwin that many different libraries are also available (gd, jpeg, etc.).
8) Example Script
=================
#!/usr/bin/perl
#A short script to demonstrate how to download sequences from GenBank
and access
#the sequence and some associated annotations using Bioperl.
use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed
#Get some sequence IDs either like below, or read in from a file. Note that
#this sample script works with the accession numbers below (at least at
the time
#it was written). If you add different accession numbers, and you get
errors,
#you may be calling for something that the sequence doesn't have.
You'll have
#to add your own error trapping code to handle that.
my @ids = ('K03160', 'AB039327', 'BC035972');
#Create the GenBank database object to read from the database.
my $gb = new Bio::DB::GenBank();
#Create a sequence stream to pass the sequences from the database to the
program.
my $seqio = $gb->get_Stream_by_id(\@ids);
#Loop over all of the sequences that you requested.
while (my $seq = $seqio->next_seq) {
#Here is how you get methods directly from the RichSeq object. Replace
#'display_name' with any other method in Table 2. that can be called on
#either the RichSeq object directly, or the PrimarySeq object which it has
#inherited.
print "Display Name: ", $seq->display_name,"\n";
print "Sequence Date: ",$seq->get_dates,"\n";
#Here is how to access the classification data from the species object.
my $species = $seq->species;
print "Species :", $species->common_name,"\n";
my @class = $species->classification;
print "Classification: @class\n";
#Here is a general way to call things that are stored as a
Bio::SeqFeature::
#Generic object. Replace 'source' with any other of the "major"
headings in
#the feature table (e.g gene, CDS, etc.) and replace 'organism' with
any of
#the tag values found under that heading (mol_type, locus_tag, gene, etc.)
my @source_feats = grep { $_->primary_tag eq 'source' }
$seq->get_SeqFeatures();
my $source_feat = shift @source_feats;
my @mol_type = $source_feat->get_tag_values('mol_type');
print "Molecule Type: @mol_type\n";
#Here is a general way to call things that are stored as some type of a
#Bio::Annotation oject. This includes reference information, and
comments.
#Replace reference with 'comment' to get the comment, and replace
#$ref->authors with $ref->title (or location, medline, etc.) to get other
#reference categories
my $ann = $seq->annotation();
my @references = ($ann->get_Annotations('reference'));
my $ref = shift @references;
my ($title, $authors, $location, $pubmed, $reference);
if (defined $ref) {
$authors = $ref->authors;
print "Authors: $authors\n";
}
print "Sequence: \n", $seq->seq, "\n\n";
}
Brian Osborne wrote:
>Colin,
>
>If you'd like a command-line environment like some sort of Unix install
>Cygwin (www.cygwin.com). No need to install everything, just click the
>"View" button in the main installation window and select and install the
>minimum, something like gcc, binutils, cvs, openssh, make, Perl.
>
>Brian O.
>
>-----Original Message-----
>From: bioperl-l-bounces at portal.open-bio.org
>[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Colin Erdman
>Sent: Tuesday, April 12, 2005 2:23 PM
>To: 'Mingyi Liu'; 'Stefan Kirov'
>Cc: 'Bioperl list'
>Subject: RE: [Bioperl-l] Re: Entrez gene parser code
>
>
>I am between Linux installs right now and actually running win32 with the
>ActiveState Perl install... How does one add the cvs.open-bio.org repository
>to the PPM console list to search through it and install the bioperl-live
>packages etc? I don't see a comparable cvs command within it.
>
>This is all new to me and I appreciate the help!
>Thanks,
>Colin
>
>-----Original Message-----
>From: Mingyi Liu [mailto:mingyi.liu at gpc-biotech.com]
>Sent: Tuesday, April 12, 2005 10:56 AM
>To: Stefan Kirov
>Cc: Colin Erdman; Bioperl list
>Subject: Re: [Bioperl-l] Re: Entrez gene parser code
>
>Stefan Kirov wrote:
>
>
>
>>In order for this parser to work you need to get
>>GI::Parser::Entrezgene from sourceforge. You can get the address for
>>this module from the perl doc of entrezgene: perldoc
>>Bio::SeqIO::entrezgene
>>Stefan
>>
>>
>>
>I just want to add that I will be adding GI::Parser::EntrezGene to cpan
>in a few days, and most likely the name space will switch to Bio::ASN1
>(therefore it'd be Bio::ASN1::EntrezGene) based on PAUSE admin suggestion.
>
>Thanks,
>
>Mingyi
>
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT
More information about the Bioperl-l
mailing list