[Bioperl-l] [RFC] Interolog::Walk
Giuseppe Gallone
G.Gallone at sms.ed.ac.uk
Thu Aug 19 10:45:36 UTC 2010
Hi Dave,
thank you very much for your helpful comments.
Regarding the module name: I will follow your advice and avoid to
propose a new root during the module registration. As for the second
level, I haven't been able to find anything related to
homology/orthology, therefore I'm not sure whether I should go for
Bio::Orthology::InterologMap
or
Bio::Homology::InterologMap
The first one being maybe a bit more specific. I might also expand
further as in
Bio::Orthology::Interolog::Map,
just in case somebody else finds other interesting applications for the
Interolog concept and would like to "plug in" their own contribution.
Would this make any sense?
I also appreciate your comments on the documentation. The one I provided
is actually not the full pod I was planning to include, but rather an
extract. What I have at the moment is a description, for each method, in
the following form:
=====================================
remove_duplicate_rows
Usage : $RC = InterologMap::remove_duplicate_rows(input_handle
=> $dbh,
output_handle => $out_data,
header
=> 'standard',
);
Purpose : This is used to clean up a TSV data files of duplicate
entries. Occasionally, Intact can return duplicate
entries. This routine will make sure no such
duplicates are kept. A new datafile is built.
The number of unique data rows is updated.
Returns : success/error
Argument : database handle to input file, filehandle to
outputfile, header type. Header type is one of the following:
- "standard": when the routine is used to clean up an
interolog walk file (the header will be longer)
- "direct": when the routine is used to clean up a
file of real db interaction (the header is shorter)
- no field provided: default is standard
Throws : -
Comment : Sample
See Also :
=======================================
On top of that, there is a DESCRIPTION, USAGE, and SYNOPSIS. The
synopsis has some code with an example of typical usage of the module.
Please take a look at this (attached below) and tell me what you think.
You mention that the description contains a lot of background
information. Would you recommend reducing it, or placing it elsewhere?
I was considering to write a little tutorial in latex as soon as
possible anyway, to provide a "centralised" source of information to
familiarise with the module. Does this respect the CPAN regulations?
As for your question on the structure of the module: you are indeed
right, the idea when running the "orthology walk" is to create a
pipeline of subroutines: there's a core set of subroutines meant to work
in strict sequentiality.
Each of these subroutines expects, as input, the output of the previous
one. The input/output dataset is currently in the form of a TSV text
file, which I process with the help of the DBI module (to be more
specific, I use DBD::CSV).
While there's a certain flexibility regarding how to use the module, one
core idea remains: in order to get the set of putative interactors, the
user would have to call at least three basic routines:
(A)
=================
1)get_forward_orthologies(): this queries the initial gene list against
one or more Ensembl dbs (using the Ensembl Perl Api) and retrieves their
orthologues, plus a number of ancillary data fields (mainly conservation
data, eg dn/ds ratio,distance from ancestor,orthology type, etc)
2)get_interactors(): this queries the orthology list built in the
previous stage against a PSICQUIC-enabled PPI db using Rest (at the
moment I only query the EBI Intact DB, but it should be easy to expand
this and query all PSICQUIC compatible PPI dbs transparently). This step
will "fatten" the dataset built in (1) with the interactors of those
orthologues, plus ancillary data (including lots of parameters
describing the quality, nature, origin of the annotated interaction)
3)get_backward_orthologies(): this queries the interactor list built in
the previous stage against one or more Ensembl dbs to find orthologues
*back* in the original species. It also adds a number of supplementary
information just like in (1).
==================
At the end of this procedure the user will have a TSV files where each
row contains a binary putative interaction plus (currently) 37
supplementary data fields.
One can then scan these results to check for duplicates, to compute
counts, to see if we have discovered new gene ids that were not present
in the original dataset (hopefully we have :) ).
Most importantly, one can then further process these results to do one
or more of the following:
(B) compute a global confidence score to assess the reliability of the
each binary putative interaction
(C) extract the binary putative PPIs from the dataset and save them in a
format compatible with Cytoscape: this helps providing a visual quality
to the result: one can then apply network analysis tools to discover
motifs, clusters, etc. The format I use is currently .SIF + attributes,
as detailed in
http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual/Network_Formats
(D) given the same initial gene list, one can also build a dataset of
REAL, experimentally-obtained PPIs,(without mapping through orthologies
in other species). One can then compare this dataset with the Putative
dataset to see if/where the two overlap, what's the intersection or the
differences, etc.
In order to suggest ways of using the module I have written 4 sample
scripts and I will include them in the module. Each script utilises the
module and uses/reuses subroutines in a pipeline fashion, and does the
following:
1)doInterologWalk.pl: runs the basic pipeline in (A)
2)doScores.pl: computes and adds confidence scores as explained in (B)
3)doNetworks.pl: computes SIF network + attributes as in (D)
4)getRealInteractions.pl: runs a pipeline to obtain real PPIs from the
inital gene set.
Hope I didn't make this too confusing. I would love to hear back from
you and from anybody else that would like to provide feedback.
Cheers
Giuseppe
On 18/08/10 17:52, Dave Messina wrote:
> Hi Giuseppe,
>
> Sounds really interesting — thanks for posting this.
>
>> Bio::Orthology::InterologWalk
>
> I vote for this name, or in any case something with Bio:: as the top-level namespace since it's a biology-related package.
>
> I like that you're providing a lot of background and information about the project in the documentation. However, the USAGE section should give information about how to use the module, with example code. You can look at other modules on CPAN (or in BioPerl) to see the conventions for writing documentation.
>
> Also, from what you wrote, it sounds like this might be a pipeline or a script rather than a module per se, or perhaps a script and a set of modules. It would be helpful to clarify in your documentation (if you haven't already) how exactly things are organized (and of course example code will help with that, too).
>
>
> Hope that's helpful, and let us know when you've got it up on CPAN so we can try it out!
>
>
> Dave
>
>
NAME
Interolog::Walk - Retrieve, score and visualize putative
Protein-Protein
Interactions through the orthology-walk method
SYNOPSIS
use Interolog::Walk;
First, obtain Intact Interactions for the dataset (see example in
"getDirectInteractions.pl"):
#get a registry from Ensembl
my $registry = InterologMap::setup_ensembl_adaptor(connect_to_db
=> $ensembl_db,
source_species
=> $sourceorg,
verbose
=> 1
);
#query actual interactions
$RC = InterologMap::Direct::get_direct_interactions(registry
=> $registry,
source_species => $sourceorg,
input_path
=> $in_path,
output_path
=> $out_path,
url
=> $url,
);
do some postprocessing (see "do_counts()" and "extract_unseen_ids()" )
and then do the actual interolog walk on the dataset with the following
sequence of three methods.
get orthologues of starting set:
$RC = InterologMap::get_forward_orthologies(registry =>
$registry,
ensembl_db =>
$ensembl_db,
input_path =>
$in_path,
output_path =>
$out_path,
source_org =>
$sourceorg,
dest_org =>
$destorg,
);
add interactors of orthologues found by "get_forward_orthologies()":
$RC = InterologMap::get_interactions(input_path => $in_path,
output_path => $out_path,
url => $url,
url_global => $url_global,
);
add orthologues of interactors found by "get_interactions()":
$RC = InterologMap::get_backward_orthologies(registry =>
$registry,
ensembl_db =>
$ensembl_db,
input_path => $in_path,
output_path =>
$out_path,
error_path =>
$err_path,
source_org =>
$sourceorg,
);
do some postprocessing (see "remove_duplicate_rows()", "do_counts()",
"extract_unseen_ids()") and then optionally compute a composite score
for the putative interactions obtained:
$RC = InterologMap::Scores::compute_scores(input_path =>
$in_path,
score_path =>
$score_path,
output_path =>
$out_path,
term_graph =>
$onto_graph,
M_IT_SCORE => $M_IT,
M_DM_SCORE => $M_DM,
M_ME_DM_SCORE =>
$M_MDM,
M_ME_TAXA_SCORE =>
$M_MTAXA
);
get some networks and network attributes which you can then visualise
with cytoscape
$RC = InterologMap::Networks::do_network(registry =>
$registry,
db =>
$ensembl_db,
input_path =>
$in_path,
output_path =>
$out_path,
source_org =>
$sourceorg,
orthology_type =>
$orthtype,
);
$RC = InterologMap::Networks::do_attributes(registry =>
$registry,
input_path =>
$in_path,
output_path =>
$out_path,
source_org =>
$sourceorg,
label_type =>
'external name'
);
*The synopsis above only lists the major methods and parameters.*
DESCRIPTION
A common activity in computational biology is to mine protein-protein
interactions from publicly available databases to build
*Protein-Protein
Interaction* (PPI) datasets. In many instances, however, the number of
experimentally obtained annotated PPIs is very scarce and it would be
helpful to enrich the experimental dataset with high-quality,
computationally-inferred PPIs. Such computationally-obtained
dataset can
extend, support or enrich experimental PPI datasets, and are of crucial
importance in high-throughput gene prioritization studies, i.e. to
drive
hypotheses and restrict the dimensionality of functional discovery
problems. This Perl Module, Interolog::Walk, is aimed at building
putative PPI datasets on the basis of a number of comparative biology
paradigms: the module implements a collection of computational biology
algorithms based on the concept of "orthology projection". If
interacting proteins A and B in organism X have orthologs A' and B' in
organism Y, under certain conditions one can assume that the
interaction
will be conserved in organism Y, i.e. the A-B interaction can be
"projected through the orthologies" to obtain a putative A'-B'
interaction. The pair of interactions (A-B) and (A'-B') are named
"Interologs".
Interolog::Walk collects, analyses and collates gene orthology data
provided by the Ensembl Consortium as well as PPI data provided by EBI
Intact. It provides the user with the possibility of rating the quality
and reliability of the putative interactions collected, by means of
confidence scores, and optionally outputs network representations
of the
datasets, compatible with the biological network representation
standard, Cytoscape.
BASIC USAGE
Rationale behind "Interolog::Walk".
\EBI Intact API/
.--------------. | .-------------.
(2) | A(e.g. mouse)|<------------------------>| B(mouse) | (3)
`--------------' <PPI> `-------------'
^ |
/Ensembl\ | <Orthology> <Orthology> | \ Ensembl /
/ Compara \ | | \Compara/
/ Api \ | | \ Api /
| |
.--------------. .-------------.
(1) | A'(e.g. fly) |. . . . . . . . . . . . . | B'(fly) | (4)
`--------------' [SCORED]PUTATIVE PPI `-------------'
(Output of Interolog::Walk)
In order to carry out an interolog walk we start with a set of gene
identifiers in one organism of interest (1). We query those ids against
a number of comparative biology databases to retrieve a list of
orthologues for the gene id of interest, in one or more species (2). In
the next step we rely instead on PPI databases to retrieve the list of
available interactors for the protein ids obtained in (2). The
output at
this stage consists of a list of interactors of the orthologues of the
initial gene set, plus several fields of ancillary data (whose
importance will be explained later) (3). In the last step of this
process we will need to project the interactions in (3) - again using
orthology data - back to the original species of interest. The
output of
the process is a list of PUTATIVE INTERACTORS of the initial gene set,
plus several fields of ancillary data.
"Interolog::Walk" provides three main functions to carry out the basic
walk, "get_forward_orthologies()", "get_interactions()" and
"get_backward_orthologies()". These functions must be called strictly
sequentially in your script, as the process, analyse and attach data to
the output in a pipeline-like fashion, i.e. processing the output
of the
preceding function.
get_forward_orthologies
get_interactions
get_backward_orthologies
SCORING THE PUTATIVE INTERACTIONS
BUILDING PUTATIVE INTERACTION NETWORKS
BUGS
Please report any you find
SUPPORT
TODO
AUTHOR
Giuseppe Gallone <ggallone at cpan.org>
CPAN ID: GGALLONE
University of Edinburgh
COPYRIGHT
The Interolog::Walk module is Copyright (c) 2010 Giuseppe Gallone All
rights reserved.
You may distribute under the terms of either the GNU General Public
License or the Artistic License, as specified in the Perl 5.10.0 README
file.
SEE ALSO
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Bioperl-l
mailing list