[Open-Bioinformatics-Foundation] Open Bioinformatics Foundation Summer 2002 Newsletter

chris dagdigian dag@sonsorol.org
Tue, 30 Jul 2002 14:23:08 -0400


This is a multi-part message in MIME format.
--------------030700040206040903030600
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

This email is being sent to every person who is subscribed to one of
the discussion or announce lists that we host. Rather than cross post
to 30+ individual mailing lists we merged all the subscribers into a
unique list. This helps to keep our active volunteers from getting
many copies of the same message. Today's mailing is going out to
2788 unique email addresses. If you recieve this message and are
not sure why please contact Chris Dagdigian <dag@sonsorol.org>.

The text version of the newsletter is enclosed as an attachment.

Formatted PDF and HTML versions that include the group picture taken
at the Arizona portion of the Hackathon are available at the following
URLs:

(HTML) http://open-bio.org/newsletters/2002-08-newsletter.html
(PDF)  http://open-bio.org/newsletters/2002-08-newsletter.pdf

Feedback is encouraged. The O|B|F board can be emailed at
'board@open-bio.org'. We look forward to seeing many of you this
week at the BOSC'02 and ISMB'02 conferences in Edmonton, Canada.

On behalf of the board of directors,
Chris Dagdigian

--------------030700040206040903030600
Content-Type: text/plain;
 name="2002-08-newsletter.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="2002-08-newsletter.txt"


Open Bioinformatics Foundation Newsletter

                                  Abstract

         The Summer 2002 Open Bioinformatics Foundation newsletter
            ____________________________________________________

   Table of Contents
   [1]Open Bioinformatics Foundation Report
   [2]2002 BioHackathon Report
   [3]O|B|F Project Reports

Open Bioinformatics Foundation Report

Board Mission Statement

The Open Bioinformatics Foundation is a non profit, volunteer run
organization focused on supporting open source programming in
bioinformatics. The foundation grew out of the volunteer projects Bioperl,
BioJava and Biopython and was incorporated to handle our modest requirements
of hardware ownership, domain name management and funding for conferences
and workshops. The Foundation does not participate directly in the
development or structure of the open source work, but as the members of the
foundation are drawn from the member projects, there is clear commonality of
direction and purpose. Occasionally the Open-Bio board may make
announcements about our direction or purpose (a recent one was on the
licensing of academic software) when the board feels there is a need to
clarify matters, but in general we prefer to remain simply the support
organization for our member projects.

Currently the foundation has a board of 5 people running it with Ewan Birney
as President, Chris Dagdigian as Treasurer, Andrew Dalke as Secretary, and
Hilmar Lapp and Steve Brenner as board members. Our main activities are:

    a. Underwriting and organizing the BOSC conference
    b. Underwriting and organizing hackathon conference
    c. Management of O|B|F servers and other assets

   We have an application pending with the US Internal Revenue Service
   (IRS) for tax-exempt status as a 501(c)(3) non-profit foundation.
   Included in this newsletter is a basic overview of our financial
   status and year 2002 activity. Official numbers that include the
   financial outcome of the BOSC'2002 conference will be available in our
   annual report which will be produced at the end of our fiscal year.

   The next O|B|F board of directors meeting will occur in Edmonton,
   Canada at the site of our BOSC'2002 conference. Our meetings are open
   to the public. Details concerning time and venue will be posted at
   BOSC'2002. The email contact address for the board members is
   'board@open-bio.org'.
     _________________________________________________________________

Financial Overview

Bank Balance as of July 21, 2002: $16,322.13

   Table 1. OBF Financial Transactions (since Jan 1, 2002)
   Date Payee Amount Description
   2002-01-26 Chris Dagdigian $128.84 Reimburse Chris for paying for 2002
   Hackathon lunch(1st day)
   2002-01-27 Chris Dagdigian $124.80 Reimburse Chris for paying for 2002
   Hackathon lunch(2nd day)
   2002-02-13 Heller Ehrman Attorneys $357.00 Reimburse our lawyers for
   foundation incorporation fees
   2002-02-13 Chris Dagdigian $192.00 Reimburse Chris for 1 year rental
   of O|B|F post office box
   2002-04-17 MAPS Inc. $150.00 1 year subscription fee to mail-abuse.org
   anti-spam blackhole list(s)

   Upcoming expenses we forsee

   Domain name renewal fees (minor < $200)
   Potential BOSC 2002 conference financial loses (unlikely)

   We do not operate our BOSC conferences with the goal of making lots of
   money. Traditionally we aim to either break even on expenses or make a
   small  profit. Attendance for BOSC'2002 is looking very good but given
   some  unforeseen  travel  and housing expenses for the 2002 conference
   there  is a small chance that our expenses will exceed what we take in
   as registration fees.
     _________________________________________________________________

Website and mailing list statistics

   Table 2. Website statistics Year 2002 to date (through July 21, 2002)
   Site          Unique Visitors Page Views Hits    HTTP Traffic
   bioperl.org   75,659          475,493    667,491 19.28 GB
   biojava.org   54,057          411,691    566,718 25.14 GB
   biopython.org 24,331          33,156     154,156 3.61 GB
   open-bio.org  20,820          45,633     131,891 818.16 MB

   Table 3. Website statistics Month of July 2002 (through July 21, 2002)
   Site          Unique Visitors Page Views Hits   HTTP Traffic
   bioperl.org   6,770           45,886     61,161 2.21 GB
   biojava.org   6,687           34,934     48,222 1.86 GB
   biopython.org 2.714           3,821      22,339 423.0 MB
   open-bio.org  1,905           4,402      16,128 90.87 MB

   Table 4. Mailing list statistics (as of July 21, 2002)
          list         Subscribers
        Bioperl-l          944
   Bioperl-announce-l      827
        Biojava-l          583
        BioPython          252
           DAS             238
      Bosc-announce        215
     Bioperl-guts-l        205
       BioXML-dev          188
   BioPython-announce      177
     BioXML-announce       111
      Biopython-dev        96
         BioBiz            90
      I3C-techarch         73
       Open-Bio-l          62
         moby-l            50
       Biocorba-l          45
         Authors           41
      I3c-pathways         32
   Biocorba-announce-l     29
        Biograph           26
     Infrastructure        22
         Root-l            16
       Ontologies          15
       I3C-roadmap         11
        MOBY-guts           9
        Naming-l            9
        Volunteer           7
         Webteam            7
        Biosoap-l           7
        Dynamite            5
        Technical           5
        Mailteam            3
      DAS-announce          1
     _________________________________________________________________

2002 BioHackathon Report

Spanning 6 weeks and 2 continents the first ever "biohackathon" was a great
success and one that the O|B|F would like to see established as an annual
event. The invitation-only gathering of open source bioinformatics
developers was split over two sessions, the first one being at the O'Reilly
Bioinformatics Technology Conference in Arizona, USA and the second in Cape
Town, South Africa organized by Electric Genetics. The hackathon was
additionally supported by AstraZeneca and Dalke Scientific Software. All the
code generated was immediately committed to the publicly accessible cvs
system on open-bio (instructions at http://cvs.open-bio.org/).

The hackathon drew together 20+ developers across a number of different open
source projects. The aim was to develop an infrastructure for accessing
sequence databases transparently that scales from a small single computer in
a molecular biology lab to a large scale pipeline project. This
infrastructure can be transparently shared between the different language
projects - eg, building a sequence database in Bioperl but accessing it from
BioJava. The hope is that one can both reduce the time it takes to build and
test applications in different languages and, at the same time, reduce the
overhead in managing and deploying sequence databases in bioinformatics
installations. Aware of the need for snazzy acronyms for standards to allow
people to dazzle their managers/sales force/bosses the participants named
this the "Open Bioinformatics Database Access" scheme (OBDA for short).

Attendees settled on a standard set of 6 implementations to retrieve
sequences, differing in their complexity, network requirements and
throughput. In all cases they were taking an existing system from an open
source project and wherever possible following existing standards. Having
discussed the specifications of these methods participants then implemented
the system in 5 languages - Perl, Java, Python, Ruby and C (not all
languages got all implementations due to limitations in programming time,
but Perl, Java and Python had a full suite). The implementations where then
tested between different languages to ensure programmatic and data transfer
capabilities. Finally the different methods were performance tested and a
number of performance bottlenecks removed.

At the same time a number of other projects were advanced. A framework for
Bibliographic objects was discussed and Perl and Java code provided. The
Genquire Perl GUI was adapted to work on top of aspects of the OBDA system.
Bio::Graphics, a GIF drawing system for Perl was integrated into Bioperl.
The OmniGene project became more plug-and-play with BioJava.

One important corollary of the biohackathon was strengthening the common
conceptual view of our data. For the last five years all the projects have
by and large been sticking to a common core of EMBL/GenBank format
information in their data model. It was unclear how to extend this model
into other areas without losing cross-project interoperability. The
requirement of all projects to read and write to a relational database
(BioSQL) forced us to re-examine our common data model away from the
perspective of a data format. The result was in fact closer cooperation and
a clearer understanding of how to extend our data models in cross project
compatible manner. In particular it was decided to make ontology integration
an explicit option for information, allowing more flexibility and richness
in describing the additional data attached to sequences.

Finally, we had fun. South Africa was a real eye opener for us, with
incredible scenery, lovely people and real attention to detail from our
hosts, Electric Genetics. But we are also hackers, and all of us got a kick
out of simply being able to work together with few distractions and an open
802.11b wireless network. Having a turn around time of minutes in a Q/A
session, rather than potential days when people are working via email in
different time zones was sensational.

All the Open-Bio.org projects and the O|B|F community in general was
strengthened immeasurably by the hackathon. We would like to take this space
to sincerely thank the hackathon organizers (Electric Genetics and O'Reilly)
and sponsors (Astra Zeneca and Dalke Scientific).

   [hackathon-arizona.jpg]

   A group picture of the Arizona hackathon attendees (minus Andrew Dalke
   and Martin Senger). A picture gallery from the biohackathon can be
   found online at http://technophage.com/gallery/
     _________________________________________________________________

O|B|F Project Reports

This has been an important year for the O|B|F projects. Bioperl released its
1.0 stable release after 7 years of development, BioJava and Biopython have
continued to produce new iterations of their software, and the cross-talk
collaboration through the formal creation of O|B|F and the Biohackathons
have encouraged the projects to grow together towards collective goals of
easy to use software tools for bioinformatics. The addition of a number of
projects to the O|B|F family including BioMOBY, BioDAS, BioRuby (hosted in
Japan).
     _________________________________________________________________

Bioperl

The Bioperl project has been very active over the past 9 months. We released
our major 1.0 release in March of 2002 and 2 subsequent bugfix point
releases in June and July. The most recent release contained over 400
modules and 160k lines of code. The project team has seen an influx of new
ideas addressing new (for us) domains in life sciences programming including
phylogenetic trees, sequence cluster, sequence rendering, fast and
lightweight databases for sequence features, generalized parsers for
sequence database search results (like BLAST and FastA), structure, and
improvements all around the design of the system. We expect to be expanding
the toolkit's horizon from sequence analysis to tasks surrounding gene
expression data, biological ontologies, and comparative genomics.

The BioHackathons held in Arizona and South Africa January and February 2002
allowed many of the Bioperl Core developers to meet and muse on future areas
of the toolkit as well as coordinate collaborative projects with other OBF
developers. These joint projects include the Open Bioinformatics Database
Access standard for sequence databases access that all OBF projects are
planning to implement. This standard along with the associated BioSQL
project will help developer rely on a defined data access model and focus on
the implementation of their client libraries.

A few new sub-projects have been initiated in the past 6 months.

    a. bioperl-pipeline, managed mostly by the Fugu genome research group
       in Singapore, which is designed to assist centers building
       analysis pipelines for small to medium size and complexity.
    b. bioperl-run, a collection of modules intended to wrap local and
       remote execution of analysis programs. This includes wrappers
       around the EMBOSS package, PAML, PHYLIP, BLAST, and remote
       execution on the NCBI BLAST Queue and Pasteur's Pise system.
     _________________________________________________________________

BioJava

BioJava is a set of open source libraries for bioinformatics developers and
researchers, with a current emphasis on handling, analysis, and
visualization of biological sequence data. With the project now in its third
year, we have released the 1.2x stable series, which includes a wide range
of incremental improvements and bug-fixes, plus more graphical components
and support for the BioSQL sequence database technology.

More recent developments include support for OBDA, a suite of data-exchange
technologies agreed at the O'Reilly and Electric Genetics hackathon
meetings, and support for additional file formats. A companion project,
biojava-lims, has been started to provide support for scientific workflow
management.

BioJava is an open source project (LGPL). All contributions -- code,
documentation, or ideas -- are welcome. For more information see
http://www.biojava.org/
     _________________________________________________________________

Biopython

The BioPython project was started in August 1999 to create a general open
source toolkit in python to help manage and analyze biomedical data. It
provides modules that can help in every step of typical bioinformatics
tasks: retrieving information from databases (local or over a network),
parsing the data into general Python objects, analyzing the data with
general algorithms, and writing the data back out into common formats.
Currently, BioPython can handle nearly 30 databases and applications.

Because of the growth in the capabilities of BioPython, we are currently
working on more general code to help manage the different databases and
formats. For example, in the current development version, we have code that
can autodetect data formats and then automatically parse it into a correct
data structure. Similarly, we are unifying the APIs to retrieve, manage, and
analyze data. We are excited about these developments and believe it will
make BioPython more accessible and powerful. Stay tuned...
     _________________________________________________________________

The BioMOBY Project

BioMOBY is an Open Source (OS) research project which aims to explore
architectures for the discovery and distribution of biological data using
web-services; data and services are decentralized, but the availability of
these resources, and the instructions for interacting with them, are
registered in a central location. In the current architecture, the central
registry ("MOBY-Central") breaks with the web-services paradigm, as
exemplified by Universal Data Discovery and Integration (UDDI), by having a
lighter, object-driven registry query system. This allows users to traverse
expansive and disparate datasets where each possible next step is presented
based on the data currently in-hand. Moreover, the path from in-hand to
desired end-point data can be automatically discovered using the registry.
In addition, the registry itself is itself capable of creating service
description (WSDL) documents in response to specific client requests. This
greatly simplified simplifies service deployment, with the aim of
encouraging the participation of service providers.

Data in BioMOBY is passed in the form of MOBY-Objects, which are (generally)
lightweight XML and make up both the query and the response of a SOAP
transaction. Object-types are organized in a hierarchy. The Object
hierarchy, with both IS A and HAS A relationships, provides several powerful
opportunities: discovery of 'base' Objects within more complex Objects,
allowing complex Objects to be used as input to a broader range of services;
backwards compatibility with old clients as new Objects are defined;
Server-generated Objects need be only as complex as the Server is capable
of, enhancing the number of Object and Services that a service provider can
host. Important to note is that the 'base' MOBY-Object can be used as a
shell around objects from any other object model system, allowing BioMOBY to
transport Objects defined by, for example OMG, with no modifications.
Finally, cross-links may be included by the service provider in any output
Object, enabling the client to branch into related data sources to retrieve
supplementary information.

Service-types are also organized into a hierarchy. This allows automated
discovery of new instances of service types through querying for a 'base'
type, and enhances the human-readable descriptive capabilities of the
Service vocabulary (e.g. Blast is both an "alignment" Service and a
"sequence similarity" Service type, depending on what you were searching
for).

A prototype MOBY-Central is currently publicly available, and is regularly
being enhanced as the requirements of the BioMOBY system become clear.
Services are being deployed increasingly rapidly, though most are currently
developed with the aim of solving several use-case biological queries.
Currently MOBY-Services are available at TAIR, FlyBase, and PBI-NRC, and
these can be discovered by querying MOBY_Central.

Details about the project, including the MOBY-Central API and all code, are
available at http://www.biomoby.org.
     _________________________________________________________________

BioRuby

BioRuby project was started in late 2000, and the first year was mainly
spent for building basic frameworks. We knew there were some other open Bio*
projects and are leading the scene, we want to have yet another toolkit with
our favorite language Ruby. Ruby, the object oriented scripting language
born in Japan, has a lot of good features also suitable for bioinformatics,
and we love its simple and powerful syntax.

During these 6 months after the BioHackathon, BioRuby became to have some
OBDA capabilities including BioRegistry, BioFetch, and BioSQL. Besides, we
also provide a BioFetch server at biofetch.bioruby.org. Among other
features, remote Fasta/Blast with common APIs against the server in Japan
looks mature now.

We will follow the rest OBDA specs such as flatfile indexing, XEMBL, and
BioCorba in the near future. We are also working on supporting external
applications like EMBOSS, HMMER etc. and challenging to handling pathway
data in KEGG database.
     _________________________________________________________________

Open Bioinformatics Database Access standard - OBDA

Even in the relatively small world of bioinformatics different people prefer
different languages and that's not going to change. Some love the
expressiveness of Perl, others the simple power of Python, and others the
static typing of Java. Flexibility can be good, but it may mean the tools
you want are not available in your language of choice.

There are many ways to let programs written in different languages work with
each other. Two programs could exchange files in a well-defined format, or
send XML over an HTTP connection, or talk to a common database using SQL, or
use an integration tool like CORBA. The implementation choice depends on the
requirements.

Twice during this year members of the different Bio* projects met together
for a Biohackthon, with the explicit goal of identifying, defining, and
implementing standard interfaces and protocols for information exchange
between the projects. Here is a short summary of each project. For more
information, see http://obda.open-bio.org.
     _________________________________________________________________

BioSQL

The sequence database is a core part of almost every bioinformatics project.
Many people store sequence data in a relational database system like MySQL,
PostgreSQL, or Oracle. BioSQL is a schema definition for the sequences,
features, cross-references, and other data types found in GenBank/EMBL and
related databases. The different language projects then provide bindings on
top of the SQL to simplify database searches and convert the remote database
information into local objects.

Supported in : Bioperl, Biopython, BioJava, BioRuby
     _________________________________________________________________

Flatfile indexing

On the other hand, some labs don't need the complexity of running a full
database system and simply need a way to retrieve a flat-file record quickly
from a set of files given an identifier or other attribute. The OBF flatfile
indexing specification supports this sort of indexing and record lookup. As
a result, you could use the BioJava implementation to build an index of all
of GenBank, then when your perl-based web application needs record
'AI129902' use the Bioperl implementation to get that record and pull out
the fields you need.

Supported in: Biopython, BioJava, Bioperl, C
     _________________________________________________________________

BioFetch

At other times, the easiest way to get a sequence record is over http
through the standard CGI interface. BioFetch defines how to compose the CGI
request, including the database name, record identifier, and output format.
Clients send the GET string to the server, which returns the record in the
requested format.

Client support: BioRuby, BioJava, Bioperl, Biopython Server support:
BioRuby, Bioperl
     _________________________________________________________________

XEMBL

SOAP is starting to replace CGI as a way for two programs to communicate
over http. XEMBL defines a SOAP protocol to ask for an EMBL record and get
the data in an XML format like BSML or AGAVE. The EBI has setup a server
which serves up XEMBL as SOAP or as static XML through a simple CGI.

Supported by: (clients can read one of these formats and connect to the
website) Bioperl, BioJava
     _________________________________________________________________

BioCORBA

CORBA is a middle-ware layer where clients and servers can communicate and
share objects that are written in different languages such as Perl, Java,C,
and Python. The BioCORBA project started in 1999 when a specification (using
the Interface Design Language or IDL) was proposed. The specification has
been merged with the OMG's Life Sciences Research group (LSR) and describes
sequences, features, annotations, databases, and alignments. This
specification was implemented by BioJava, Bioperl, and Biopython using our
client libraries to support the various object definitions. Using this
specification then a CORBA server implementing the BioCORBA spec can for
example serve as a sequence database server. This server can be represented
by an object in a program so that programmatic access to a database server
is achieved and this object can be either be for example a local instance of
an inhouse sequence database or a remote database serving up the complete
EMBL dataset.

The BioHackathon allowed us to solidify the specification, work out some
bugs, and test cross-platform, cross-language compatibilities to insure that
all objects created by, for example, Biopython servers, would behave as
expected when used by a Bioperl client. The language bindings are still
being finished and tested but we expect to release the complete set of
packages as part of OBDA by the end of the year.

Supported by:Bioperl, BioJava, Biopython
     _________________________________________________________________

Registry

Unfortunately, we've just defined five new ways to retrieve a record given
an identifier. Mostly though you don't care to specify where the data came
from, you just want to get the data. The BioDirectory Registry is a simple
system to specify the different ways to get a database record. Suppose you
want GenBank record 'AI129902'. The Registry knows which services provide
access to GenBank and can try each in turn to get the sequence.

Supported in: Biopython, Bioperl, BioJava, BioRuby
     _________________________________________________________________

GMOD - GBrowse

The Generic Model Organism Database project http://www.gmod.org/ is a
collection of software, database schemas, and operating procedures designed
to ease the task of building a model organism genome databases. Ultimately
it aims to be a "MOD-in-a-box", a set of off-the-shelf components that will
snap together to create a complete model organism database. Currently GMOD
includes the Apollo genome editor, the web-based GBrowse genomic annotation
browser, the Bio::Graphics for generic feature rendering, a literature
search and curation system, and a generic lab protocol documentation
toolset. GBrowse recently released version 1.46, which supports semantic
zooming, reading frame analysis, third-party annotation support, and a
number of useful glyphs.

Apollo was developed as a collaboration between the Berkeley Drosophila
Genome Project (part of the FlyBase consortium) and The Sanger Institute in
Cambridge, UK. It allows researchers to explore genomic annotations at many
levels of detail, and to perform expert annotation curation, all in a
graphical environment. Apollo is being used by the FlyBase biologists to
make the final annotations on the finished Drosophila melanogaster genome,
and will also be the primary vehicle for sharing these annotations with the
community. Because of Apollo's modular, flexible framework, many research
groups are using it as a starting point for customizing their own annotation
visualization tools.

Apollo and Gbrowse are available at SourceForge:
http://sourceforge.net/projects/gmod/. Like all GMOD components, they are
distributed under the terms of the Artistic License.
     _________________________________________________________________

Bibliographic Query Service - BQS

Bibliographic search and citation are central to all scholarly and research
activities. Within the domain of life sciences research, bibliographic
citation is of particular importance for annotation of large bodies of
experimentally developed and computationally derived data and the rapidly
increasing corpus of research literature makes efficient and effective
bibliographic searches increasingly critical. This was the motivation for
adding bibliographic modules into bioperl. The bioperl bibliographic service
provide client-side modules allowing standardized access to the
repositories, such as MEDLINE.

The core module Bio::Biblio is a central gate for querying bibliographic
repositories and retrieving citations from there. A default access method is
based on the SOAP technology (a Web Service approach), but the bioperl
architecture allows to plug in easily other technologies (another one -
biofetch, a traditional HTTP-based access method, is also available).

By default, the Bio::Biblio module queries the MEDLINE repository available
as an experimental service from the European Bioinformatics Institute (EBI).

Additionally, there are several modules (grouped around Bio::Biblio::IO) for
parsing and converting retrieved citations. These modules are independent on
the access method to the repository, and can be used separately, for example
to parse PubMed citations stored in the local files.

Finally, there are modules allowing to represent individual citations as
Perl objects. The object representation promotes approved standards for
bibliographic data, such as the Dublin Core Elements Metadata.

The main URL for the Bibliographic Query service is
http://industry.ebi.ac.uk/openBQS/. The Perl modules are described in
details in http://industry.ebi.ac.uk/openBQS/Client_perl.html.
     _________________________________________________________________

Pise

Pise (http://www.pasteur.fr/~letondal/Pise/) is an interface generator for
programs running under Unix. More precisely, it is a software system which,
given an XML description of a program's parameters, generates source code
for a user interface, as a component of a system where the user can easily
chain programs by pull-down menus. Two GUI generators already exist: a Web
interface generator (composed of rather basic HTML and CGI scripts), and a
Tcl/Tk interface generator, which is currently used in a prototype tool,
biok, in our laboratory. Recently, a perl/bioperl API generator has also
been developed (a Python API is planned for the end of the year).

About 300 molecular biology programs have been defined under Pise, including
various sequence analysis, phylogeny, alignment, structural analysis (RNA,
secondary and tertiary structure) and gene prediction programs. Pise has
been in production for more than 4 years at the Pasteur Institute (about
1000 submitted jobs a day during the last year) (http://bioweb.pasteur.fr/).
The whole system, e.g generators and the complete set of already defined
interfaces is also installed in several other sites, namely for interfacing
EMBOSS programs. Other users have developed new programs' interfaces (in
genetic analysis, primer design, and imaging analysis). We are also aware of
projects for building a new GUI generator.

--------------030700040206040903030600--