[Bioperl-l] [RFC] Interolog::Walk

Thu Aug 19 10:45:36 UTC 2010

Hi Dave,

thank you very much for your helpful comments.

Regarding the module name: I will follow your advice and avoid to 
propose a new root during the module registration. As for the second 
level, I haven't been able to find anything related to 
homology/orthology, therefore I'm not sure whether I should go for

Bio::Orthology::InterologMap
or
Bio::Homology::InterologMap

The first one being maybe a bit more specific. I might also expand 
further as in

Bio::Orthology::Interolog::Map,

just in case somebody else finds other interesting applications for the 
Interolog concept and would like to "plug in" their own contribution. 
Would this make any sense?

I also appreciate your comments on the documentation. The one I provided 
is actually not the full pod I was planning to include, but rather an 
extract. What I have at the moment is a description, for each method, in 
the following form:

=====================================
    remove_duplicate_rows
      Usage     : $RC = InterologMap::remove_duplicate_rows(input_handle 
    => $dbh,

output_handle   => $out_data,
                                                            header 
     => 'standard',
                                                            );
      Purpose   : This is used to clean up a TSV data files of duplicate 
entries. Occasionally,  Intact can return duplicate
                  entries. This routine will make sure no such 
duplicates are kept. A new datafile is built.
                  The number of unique data rows is updated.
      Returns   : success/error
      Argument  : database handle to input file, filehandle to 
outputfile, header type. Header type is one of the following:
                  - "standard": when the routine is used to clean up an 
interolog walk file (the header will be longer)
                  - "direct":   when the routine is used to clean up a 
file of real db interaction (the header is shorter)
                  - no field provided: default is standard
      Throws    : -
      Comment   : Sample

     See Also :
=======================================

On top of that, there is a DESCRIPTION, USAGE, and SYNOPSIS. The 
synopsis has some code with an example of typical usage of the module. 
Please take a look at this (attached below) and tell me what you think.

You mention that the description contains a lot of background 
information. Would you recommend reducing it, or placing it elsewhere?
I was considering to write a little tutorial in latex as soon as 
possible anyway, to provide a "centralised" source of information to 
familiarise with the module. Does this respect the CPAN regulations?

As for your question on the structure of the module: you are indeed 
right, the idea when running the "orthology walk" is to create a 
pipeline of subroutines: there's a core set of subroutines meant to work 
in strict sequentiality.
Each of these subroutines expects, as input, the output of the previous 
one. The input/output dataset is currently in the form of a TSV text 
file, which I process with the help of the DBI module (to be more 
specific, I use DBD::CSV).

While there's a certain flexibility regarding how to use the module, one 
core idea remains: in order to get the set of putative interactors, the 
user would have to call at least three basic routines:

(A)
=================
1)get_forward_orthologies(): this queries the initial gene list against 
one or more Ensembl dbs (using the Ensembl Perl Api) and retrieves their 
orthologues, plus a number of ancillary data fields (mainly conservation 
data, eg dn/ds ratio,distance from ancestor,orthology type, etc)

2)get_interactors(): this queries the orthology list built in the 
previous stage against a PSICQUIC-enabled PPI db using Rest (at the 
moment I only query the EBI Intact DB, but it should be easy to expand 
this and query all PSICQUIC compatible PPI dbs transparently). This step 
will "fatten" the dataset built in (1) with the interactors of those 
orthologues, plus ancillary data (including lots of parameters 
describing the quality, nature, origin of the annotated interaction)

3)get_backward_orthologies(): this queries the interactor list built in 
the previous stage against one or more Ensembl dbs to find orthologues 
*back* in the original species. It also adds a number of supplementary 
information just like in (1).
==================

At the end of this procedure the user will have a TSV files where each 
row contains a binary putative interaction plus (currently) 37 
supplementary data fields.

One can then scan these results to check for duplicates, to compute 
counts, to see if we have discovered new gene ids that were not present 
in the original dataset (hopefully we have :) ).

Most importantly, one can then further process these results to do one 
or more of the following:
(B) compute a global confidence score to assess the reliability of the 
each binary putative interaction
(C) extract the binary putative PPIs from the dataset and save them in a 
format compatible with Cytoscape: this helps providing a visual quality 
to the result: one can then apply network analysis tools to discover 
motifs, clusters, etc. The format I use is currently .SIF + attributes, 
as detailed in
http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual/Network_Formats
(D) given the same initial gene list, one can also build a dataset of 
REAL, experimentally-obtained PPIs,(without mapping through orthologies 
in other species). One can then compare this dataset with the Putative 
dataset to see if/where the two overlap, what's the intersection or the 
differences, etc.

In order to suggest ways of using the module I have written 4 sample 
scripts and I will include them in the module. Each script utilises the 
module and uses/reuses subroutines in a pipeline fashion, and does the 
following:

1)doInterologWalk.pl: runs the basic pipeline in (A)
2)doScores.pl: computes and adds confidence scores as explained in (B)
3)doNetworks.pl: computes SIF network + attributes as in (D)
4)getRealInteractions.pl: runs a pipeline to obtain real PPIs from the 
inital gene set.

Hope I didn't make this too confusing. I would love to hear back from 
you and from anybody else that would like to provide feedback.

Cheers
Giuseppe

On 18/08/10 17:52, Dave Messina wrote:
> Hi Giuseppe,
>
> Sounds really interesting — thanks for posting this.
>
>> Bio::Orthology::InterologWalk
>
> I vote for this name, or in any case something with Bio:: as the top-level namespace since it's a biology-related package.
>
> I like that you're providing a lot of background and information about the project in the documentation. However, the USAGE section should give information about how to use the module, with example code. You can look at other modules on CPAN (or in BioPerl) to see the conventions for writing documentation.
>
> Also, from what you wrote, it sounds like this might be a pipeline or a script rather than a module per se, or perhaps a script and a set of modules. It would be helpful to clarify in your documentation (if you haven't already) how exactly things are organized (and of course example code will help with that, too).
>
>
> Hope that's helpful, and let us know when you've got it up on CPAN so we can try it out!
>
>
> Dave
>
>

NAME
     Interolog::Walk - Retrieve, score and visualize putative 
Protein-Protein
     Interactions through the orthology-walk method

SYNOPSIS
       use Interolog::Walk;

     First, obtain Intact Interactions for the dataset (see example in
     "getDirectInteractions.pl"):

       #get a registry from Ensembl
       my $registry = InterologMap::setup_ensembl_adaptor(connect_to_db 
  => $ensembl_db,
                                                          source_species 
=> $sourceorg,
                                                          verbose 
  => 1
                                                          );

       #query actual interactions
       $RC = InterologMap::Direct::get_direct_interactions(registry 
     => $registry,

source_species   => $sourceorg,
                                                           input_path 
     => $in_path,
                                                           output_path 
     => $out_path,
                                                           url 
     => $url,
                                                           );

     do some postprocessing (see "do_counts()" and "extract_unseen_ids()" )
     and then do the actual interolog walk on the dataset with the following
     sequence of three methods.

     get orthologues of starting set:

       $RC = InterologMap::get_forward_orthologies(registry        => 
$registry,
                                                   ensembl_db      => 
$ensembl_db,
                                                   input_path      => 
$in_path,
                                                   output_path     => 
$out_path,
                                                   source_org      => 
$sourceorg,
                                                   dest_org        => 
$destorg,
                                                   );

     add interactors of orthologues found by "get_forward_orthologies()":

       $RC = InterologMap::get_interactions(input_path    => $in_path,
                                            output_path   => $out_path,
                                            url           => $url,
                                            url_global    => $url_global,
                                            );

     add orthologues of interactors found by "get_interactions()":

       $RC = InterologMap::get_backward_orthologies(registry    => 
$registry,
                                                    ensembl_db  => 
$ensembl_db,
                                                    input_path  => $in_path,
                                                    output_path => 
$out_path,
                                                    error_path  => 
$err_path,
                                                    source_org  => 
$sourceorg,
                                                    );

     do some postprocessing (see "remove_duplicate_rows()", "do_counts()",
     "extract_unseen_ids()") and then optionally compute a composite score
     for the putative interactions obtained:

        $RC = InterologMap::Scores::compute_scores(input_path      => 
$in_path,
                                                   score_path      => 
$score_path,
                                                   output_path     => 
$out_path,
                                                   term_graph      => 
$onto_graph,
                                                   M_IT_SCORE      => $M_IT,
                                                   M_DM_SCORE      => $M_DM,
                                                   M_ME_DM_SCORE   => 
$M_MDM,
                                                   M_ME_TAXA_SCORE => 
$M_MTAXA
                                                   );

     get some networks and network attributes which you can then visualise
     with cytoscape

        $RC = InterologMap::Networks::do_network(registry            => 
$registry,
                                                    db               => 
$ensembl_db,
                                                    input_path       => 
$in_path,
                                                    output_path      => 
$out_path,
                                                    source_org       => 
$sourceorg,
                                                    orthology_type   => 
$orthtype,
                                                    );

        $RC = InterologMap::Networks::do_attributes(registry      => 
$registry,
                                                    input_path    => 
$in_path,
                                                    output_path   => 
$out_path,
                                                    source_org    => 
$sourceorg,
                                                    label_type    => 
'external name'
                                                    );

     *The synopsis above only lists the major methods and parameters.*

DESCRIPTION
     A common activity in computational biology is to mine protein-protein
     interactions from publicly available databases to build 
*Protein-Protein
     Interaction* (PPI) datasets. In many instances, however, the number of
     experimentally obtained annotated PPIs is very scarce and it would be
     helpful to enrich the experimental dataset with high-quality,
     computationally-inferred PPIs. Such computationally-obtained 
dataset can
     extend, support or enrich experimental PPI datasets, and are of crucial
     importance in high-throughput gene prioritization studies, i.e. to 
drive
     hypotheses and restrict the dimensionality of functional discovery
     problems. This Perl Module, Interolog::Walk, is aimed at building
     putative PPI datasets on the basis of a number of comparative biology
     paradigms: the module implements a collection of computational biology
     algorithms based on the concept of "orthology projection". If
     interacting proteins A and B in organism X have orthologs A' and B' in
     organism Y, under certain conditions one can assume that the 
interaction
     will be conserved in organism Y, i.e. the A-B interaction can be
     "projected through the orthologies" to obtain a putative A'-B'
     interaction. The pair of interactions (A-B) and (A'-B') are named
     "Interologs".

     Interolog::Walk collects, analyses and collates gene orthology data
     provided by the Ensembl Consortium as well as PPI data provided by EBI
     Intact. It provides the user with the possibility of rating the quality
     and reliability of the putative interactions collected, by means of
     confidence scores, and optionally outputs network representations 
of the
     datasets, compatible with the biological network representation
     standard, Cytoscape.

BASIC USAGE
   Rationale behind "Interolog::Walk".
                                   \EBI Intact API/
              .--------------.            |             .-------------.
          (2) | A(e.g. mouse)|<------------------------>|   B(mouse)  |  (3)
              `--------------'          <PPI>           `-------------'
                     ^                                         |
        /Ensembl\    | <Orthology>                 <Orthology> | \ Ensembl /
       / Compara \   |                                         |  \Compara/
      /    Api    \  |                                         |   \ Api /
                     |                                         |
              .--------------.                           .-------------.
          (1) | A'(e.g. fly) |. . . . . . . . . . . . .  |   B'(fly)   | (4)
              `--------------'     [SCORED]PUTATIVE PPI  `-------------'
                              (Output of Interolog::Walk)

     In order to carry out an interolog walk we start with a set of gene
     identifiers in one organism of interest (1). We query those ids against
     a number of comparative biology databases to retrieve a list of
     orthologues for the gene id of interest, in one or more species (2). In
     the next step we rely instead on PPI databases to retrieve the list of
     available interactors for the protein ids obtained in (2). The 
output at
     this stage consists of a list of interactors of the orthologues of the
     initial gene set, plus several fields of ancillary data (whose
     importance will be explained later) (3). In the last step of this
     process we will need to project the interactions in (3) - again using
     orthology data - back to the original species of interest. The 
output of
     the process is a list of PUTATIVE INTERACTORS of the initial gene set,
     plus several fields of ancillary data.

     "Interolog::Walk" provides three main functions to carry out the basic
     walk, "get_forward_orthologies()", "get_interactions()" and
     "get_backward_orthologies()". These functions must be called strictly
     sequentially in your script, as the process, analyse and attach data to
     the output in a pipeline-like fashion, i.e. processing the output 
of the
     preceding function.

     get_forward_orthologies
     get_interactions
     get_backward_orthologies

SCORING THE PUTATIVE INTERACTIONS
BUILDING PUTATIVE INTERACTION NETWORKS
BUGS
     Please report any you find

SUPPORT
     TODO

AUTHOR
     Giuseppe Gallone <ggallone at cpan.org>

     CPAN ID: GGALLONE

     University of Edinburgh

COPYRIGHT
     The Interolog::Walk module is Copyright (c) 2010 Giuseppe Gallone All
     rights reserved.

     You may distribute under the terms of either the GNU General Public
     License or the Artistic License, as specified in the Perl 5.10.0 README
     file.

SEE ALSO

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.