[Bioperl-l] Relations

Holland, Richard Richard.Holland at agresearch.co.nz
Tue Jan 6 20:30:15 EST 2004


Hi,

I have written a script for loading term-term or sequence-term relations
which comes in three parts - all now loaded and committed in CVS under
bioperl-live/scripts/terms. The main script (importrelation.pl) is the
core script which does not need to be touched to work (or should not!) -
just run it and it will read all it's options from the configuration
file, conf.xml. There are also a number of helper scripts of which I
have so far written just one - these scripts expect data on STDIN, and
output comma-separated pairs of terms and sequences on output. For
instance, the interpro2go.pl parser supplied (which parses InterPro->GO
relations from the interpro2go files) outputs pairs like
IPR002291,GO:0008606. All helper scripts should output in this format -
source term or sequence accession.version, comma, target term.

The relations are stored as term_relationship entries for TERM2TERM
relations, and bioentry_qualifier_value entries for SEQ2TERM relations,
the former being Bio::Ontology::TermRelationship objects and the latter
Bio::Annotation::OntologyTerm objects.

Sample output from a config file using the existing interpro2go.pl
helper script:

---START OUTPUT---
[hollandr at bifo1 biosql-go]$ ./importrelation.pl
Running parsers...
Parser interpro2go (type TERM2TERM) lives at ./interpro2go.pl
(links InterPro to Gene Ontology)
Downloading data from:
http://www.geneontology.org/external2go/interpro2go
...downloaded to /tmp/LRbPjjDpYs. Parser starting...
...1000 records processed...
...2000 records processed...
...3000 records processed...
...4000 records processed...
...could not find target object GO:0008606 (for source IPR002291) -
skipping...
...5000 records processed...
...6000 records processed...
...7000 records processed...
...could not find target object GO:0008567 (for source IPR004273) -
skipping...
...8000 records processed...
...9000 records processed...
...10000 records processed...
...11000 records processed...
...parser finished with 11395 records.
All parsers run.
---END OUTPUT---

The config file is a simple XML format - the parsers section defines all
the parsers and where they get their datafiles from (scriptdir is where
the helper scripts live). The httpproxy section defines your proxy
server if you need one (optional), and the database section is the
database config to connect to your BioSQL database with.

Each parser in the parsers section has a name and a type (TERM2TERM for
relations such as InterPro to GO, or SEQ2TERM for relations such as
SwissProt to GO). The sourcenamespace is either the ontology name of the
source terms, or the biodatabase name of the source accessions. The
targetnamespace should be the ontology name of the target terms. Script
is the name of the script to run, and the server section defines where
to get the datafile from to parse with the script.

TERM2TERM helper scripts output the first term name, comma, second term
name.

SEQ2TERM helper scripts output the accession and version separated by a
fullstop, comma, term name.

So... now all I need is some more helper scripts and the appropriate
datasources to feed them with! It would be very easy to write some more
once the datasources have been tracked down. Any volunteers? I'll keep
working on my own ones for the time being until I hear some more.

cheers,
Richard

PS. The code is not very well commented nor has any bug-checking in it,
but it works if you're sensible. I might tidy it up one day...

---
Richard Holland
Bioinformatics Database Developer
ITS, Agresearch Invermay x3279



-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: Wednesday, 7 January 2004 5:51 a.m.
To: Holland, Richard
Cc: bioperl-l at bioperl.org
Subject: Re: [Bioperl-l] Relations


I've heard someone wanted to code this up half a year ago, but AFAIK 
the script never made it into the repository. I think this would be 
very useful to have.

The GO associations for RefSeq are present in RefSeq most of the time 
but obfuscated in the feature table. I wrote a SeqProcessor that 
promotes them to annotations, but RefSeq keeps changing the format of 
that on me, so even though possible it's not an approach that warms 
your heart. Note that LL also comes with GO links in the records, which 
are extracted fairly well by the parser.

	-hilmar

On Monday, January 5, 2004, at 04:15  PM, Holland, Richard wrote:

> Me again! Sorry about this heap of messages from me today, I'm having 
> a couple of BioPerl days to hammer out all the things my boss wants 
> done before the end of the month...
>
> Has anyone got a script to load GO term associations into BioSQL 
> bioentry_qualifier_value with? If not I'll put one together and post 
> it when it's done, but I don't want to write one if one already 
> exists. I'm interested in linking GO terms to RefSeqs (not sure where 
> to get the associations from there though), GO to InterPro, and GO to 
> SwissProt. Also InterPro to SwissProt if possible.
>
> cheers,
> Richard
>
> ---
> Richard Holland
> Bioinformatics Database Developer
> ITS, Agresearch Invermay x3279
>
>
> ======================================================================
> =
> Attention: The information contained in this message and/or
attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or
privileged
> material. Any review, retransmission, dissemination or other use of,
or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by
AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
>
=======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org 
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================



More information about the Bioperl-l mailing list