[Open-bio-l] OBDA redux?

Thu Jan 12 16:49:25 UTC 2012

Hi Peter,

On 16 December 2011 12:11, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 15, 2011 at 10:01 PM, Hamish McWilliam
> <hamish.mcwilliam at bioinfo-user.org.uk> wrote:
>>> Just a quick update on this: the old OBDA specs were still in CVS in
>>> the obda-specs module (the old obda site had the module wrong).
>>> I ran git cvsimport on that after I copied the CVS repo to my laptop,
>>> so it's now on github:
>>>
>>> https://github.com/OBF/OBDA
>>>
>>> We could probably work on updates from there.
>>
>> At the risk of derailing the current thread... a few comments on the
>> "modules" in the old ODBA:
>
> Well, given the broad title of OBDA redux, why not?

Exactly :-)

>> - BioCorba: while CORBA may live on in some embedded applications it
>> has mostly been replaced by SOAP and REST web services. I suspect
>> there are few copies of the BioCorba IDLs surviving today. Possibly of
>> historic interest, but since it doesn't actually include the IDLs it
>> is not really of any use.
>
> As far as I know, BioCorba is defunct.
>
>> - biofetch: originally implemented in EBI's dbfetch, also implemented
>> by BioRuby as biofetch which had a few extensions. EBI's dbfetch has
>> since been reimplemented and attempts to be compatible but only
>> provides partial support along with various extensions, including
>> those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp.
>> I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure
>> of the current status for BioJava and BioPython.
>
> Current Biopython doesn't have anything for this, but I would probably
> want to implement this as a client not a server.

While there is a example implementation of a biofetch server in
BioPerl (http://search.cpan.org/~cjfields/BioPerl/examples/db/dbfetch),
it is the client implementations that have been the main focus in the
various projects. In BioPerl: Bio::Biblio, Bio::DB::BioFetch,
Bio::DB::EMBL, Bio::DB::RefSeq and Bio::DB::SwissProt use either
dbfetch or biofetch; in BioRuby: Bio::Fetch provides an interface to
biofetch servers, including the EBI's dbfetch.

>> - BioSQL: as you all know over at http://www.biosql.org/. The document
>> should probably be updated to point there.
>
> Agreed, done:
> https://github.com/OBF/OBDA/commit/5798f0b4a0e3b7fd0595e0ab3017d3afdda53549
>
>> - bioindex: the flat-file and BDB indexing formats. To which the
>> SQLite option will be added?
>
> Basically yes.
>
>> - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide
>> possible replacements when required.
>
> This also has implications for the bioindex code as we need to
> specify the file format being indexed (e.g. FASTA or GenBank).

And possibly a layer of semantics for the database and data in the database.

>> - bioregistry: database discovery and meta-data. From having tried to
>> implement this, the bioregisty is too limited in the available
>> meta-data to be very useful, especially when it comes to data format
>> handling. Compare with the database definitions in EMBOSS
>> (http://emboss.sourceforge.net/docs/themes/Databases.html) and the
>> dbfetch meta-data
>> (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information).

For the current EMBOSS documentation for the database definitions see
http://emboss.open-bio.org/html/adm/ch04s01.html.

> There was some partial code for this in Biopython, but it was
> deprecated and removed some time ago.

While the bioregistry stuff is conceptually quite useful... The common
format for data services to advertise the data that they provide and
the interfaces which they provide for accessing the data, which has
obvious benefits for client software. The notion of a site describing
its own services in a standardized way, so clients and crawlers can
discover the available data sources at runtime, without the inherent
problems centralized repositories present. But the current
specification is too limited since it does not allow for the
specification of data formats, or database and data semantics. Use of
a richer format and convergence with the equivalent configuration
files in EMBOSS could revive the concept, and make implementing the
client support worthwhile again.

>> - XEmbl: REST and SOAP access to EMBL-Bank entries in XML.
>> The EBI's XEmbl service was replaced by the dbfetch
>> (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch
>> (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services,
>> since these provide roughly the same functionality with wider data
>> format support.
>
> Presumably the XML format for EMBL is now one of the ISNDC
> formats also used at the NCBI? In any case, that whole folder
> is purely describing an (obsolete) EBI service, so can we just
> delete it it?

The XML formats were not described as part of the XEmbl specification,
but instead were external XML formats (BSML and Agave XML) which have
not been adopted. The current XML formats for the INSDC member
databases are in two categories:
1. INSD XML (http://insdc.org/xmlstatus.html)
2. Member database specific formats, for example ENA EMBL-Bank XML
(see http://www.ebi.ac.uk/ena/about/embl_bank_format).

The XEmbl service specification itself is obsolete and can be removed.

>> Since I've been attempting to get dbfetch to support the biofetch and
>> bioregistry specifications, my interest is much more at the web
>> service end of things. I can certainly see options for using the
>> current alternatives in dbfetch and EMBOSS to revise the
>> specifications for biofetch and bioregistry.
>>
>> Hamish
>
> How does biofetch/bioregistry compare to DAS?

biofetch specifies a HTTP GET based interface to data resources. The
databases and data formats available depend on the specific
implementation, and will generally include the main distribution
formats for the database and commonly used formats for the specific
type of data involved, for example EBI's dbfetch provides EMBL-Bank
data in:
- EMBL flatfile format
- EMBL XML
- INSD XML
- Fasta sequence format
- SeqXML

bioregistry describes available databases at a site, providing details
of how to talk to the data source and the relevant parameters required
to access a specific database. For example for EMBL-Bank via dbfetch:

[embl]
protocol=biofetch
location=http://www.ebi.ac.uk/Tools/dbfetch/dbfetch
dbname=embl

DAS is a protocol and set of data formats focused around delivery of
sequence and sequence feature data. A DAS server provides meta-data
about its capabilities and the data available through it, but knows
nothing about other DAS servers. The DAS Registry
(http://www.dasregistry.org/), provides information about registered
DAS servers and addresses this limitation, but is centralized and DAS
specific. Alternative registries (see
http://www.ebi.ac.uk/Tools/webservices/tutorials/05_registries)
address the service type limitation, but still are centralized
resources.

DAS and biofetch are complementary, DAS provides granularity and
mash-up capabilities but biofetch provides original and common data
formats.

bioregistry appears to be unused currently, but aims to provide a
format for sharing information about data services. The possibility
for convergence of this format and database configurations in EMBOSS
and service meta-data such as that provided by dbfetch would simplify
client development and simplify maintenance of database configurations
in supporting systems.

> Separately, I suggest we rename the OBDA/preamble.txt
> file to README (or README.*) so it gets shown in GitHub,
> and then update it following this discussion with some
> context (like dates current status of the different parts).

Sounds good to me.

> We should probably make the old OBDA CVS read only now.

I assume a pointer has been added to the contents of the OBDA CVS to
point to the new location on github, in which case making it read only
would be sensible.

Hamish