[DAS2] Fwd: thoughts on DAS/2 validation

Tue Dec 14 01:35:33 UTC 2004

Begin forwarded message:

From: Andrew Dalke <dalke at dalkescientific.com>
Date: September 26, 2004 11:57:46 PM PDT
Subject: thoughts on DAS/2 validation

>   writeback from DAS/2 clients to DAS/2 servers (there will be a 
> meeting Monday on this)
>   implementation details of DAS/2 clients, servers, validation suite 
> (I want to
>        keep focus on spec and logistics)

I don't think I need to attend that writeback meeting.  I
contributed my ideas during the locking/transaction meeting
from a couple of days ago, along with my followup email.

I want to mention a bit about my tentative plans for testing
and validation.

I've mentioned that I wanted to investigate using RelaxNG (RNG)
for the validation.  (www.relaxng.org).  I did that this evening.

Short history.  XML originally used DTDs for schema validation.
DTDs have various problems.  It doesn't handle data types (eg,
'start' must be a number) or unordered collections.  XML Schema
was supposed to improve on DTDs but it's a complicated standard
with lots of bits and pieces added to it in committee.

James Clark and a small group of other people worked on a few
alternates, which merged to become RNG.  This is definitely
easier to understand and use and (IMNSHO) more elegant than
XML Schema.

RNG has two representations, one in XML and the other using
a "compact" syntax which is easier for humans to work with.
The first one uses the filename suffix ".rng" and the
latter ".rnc".

There is a program called "trang" to convert between these
two formats.  It can also read/write to DTDs and write to
XML Schema.  Because it's so easy, I withdraw my complaint
about using DTDs.

To make things even more fun, there's a program called
"Examplotron" which is an XSLT (!) program to convert an
example XML document into an RNG for it.

I tested it out with the SOURCES example.  Here's the XML
from the spec, but without the DTD

<?xml version="1.0" standalone="no"?>
<SOURCES
     xmlns="http://www.biodas.org/ns/das-genome/2.00"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     xml:base="http://dev.wormbase.org/das-genome/">
   <SOURCE id="volvox" description="Volvox Example Database"
           taxon="http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118">
      <VERSION id="volvox/1" description="Build 1, October 2002" />
      <VERSION id="volvox/2" description="Build 2, January 2003" />
   ... truncated ...

Here's the RNG for it.  The "ega:example" is used to keep the
original data around.  This makes the RNG invertible back to
the example XML.  RNG validation ignores elements outside its
namespace.  This lets people add new fields, including documentation.
There's even software to turn an RNG plus documentation into
DocBook format.

<grammar xmlns="http://relaxng.org/ns/structure/1.0" 
xmlns:ega="http://examplotr
on.org/annotations/" xmlns:sch="http://www.ascc.net/xml/schematron" 
datatypeLibr
ary="http://www.w3.org/2001/XMLSchema-datatypes">
    <start>
       <element name="SOURCES" 
ns="http://www.biodas.org/ns/das-genome/2.00">
          <optional>
             <attribute name="base" 
ns="http://www.w3.org/XML/1998/namespace">
                <ega:example 
xml:base="http://dev.wormbase.org/das-genome/"/>
             </attribute>
          </optional>
          <oneOrMore>
             <element name="SOURCE">
                <optional>
                   <attribute name="id">
                      <ega:example id="volvox"/>
                   </attribute>
                </optional>
   ... truncated ...

The compact form

namespace ega = "http://examplotron.org/annotations/"
default namespace ns1 = "http://www.biodas.org/ns/das-genome/2.00"
namespace sch = "http://www.ascc.net/xml/schematron"
namespace xlink = "http://www.w3.org/1999/xlink"

start =
   element SOURCES {
     [ ega:example [ xml:base = "http://dev.wormbase.org/das-genome/" ] ]
     attribute base { text }?,
     (element SOURCE {
        [ ega:example [ id = "volvox" ] ] attribute id { text }?,
        [ ega:example [ description = "Volvox Example Database" ] ]
   ... truncated ...

and DTD

<!ELEMENT ns1:SOURCES (SOURCE)+>
<!ATTLIST ns1:SOURCES
   xmlns:ns1 CDATA #FIXED 'http://www.biodas.org/ns/das-genome/2.00'
   xml:base CDATA #IMPLIED>

<!ELEMENT ns1:SOURCE (VERSION)+>
<!ATTLIST ns1:SOURCE
   xmlns:ns1 CDATA #FIXED 'http://www.biodas.org/ns/das-genome/2.00'
   id CDATA #IMPLIED
   description CDATA #IMPLIED
   taxon CDATA #IMPLIED>
   ... truncated ...

What this means is that I should be able to get a working schema
up rather quickly.  This would then be used by the validator.

         ==========================

Here's my plan.  I'll continue to help with the spec in
finding places that need to be clarified or tweaked.  The
DAS/1 spec contained typos in the examples so I'll make sure
there's some way to extract all examples from the spec
so they can be tested.

I'll develop an RNG schema and come up with a set of test
documents, some legal and some illegal, to make sure the validator
is able to catch and report errors properly.

These sample documents will likely consist of two files, one for
the XML and one the HTTP headers.  This will let me run the
regression tests without a server.

I'll come up with a set of tests for data integrity.  For example,
get a sequence from a range and make sure the FASTA length is of
the right size.  Get all of the features in a range and test that
the server's filters work correctly.

The writeback tests will be harder.  I'll do a set of adds,
deletes, and modifies.  If we have transaction support on the
server I'll test that.  I'll test for proper lock semantics.

This is harder because it requires I be able to scribble on
a server's database, and that the code be given username/password
for how to get write permissions.  More likely two, to make
sure that locking works correctly.

What I would really like is be able to provide a known
data set that can be imported by a server specifically for
testing.  Then I could include tests that would know what
the answer is supposed to be, rather than simply testing
for internal integrity.  Some tests would include being able
to search for a feature with an attribute that includes a
& or / in it, and make sure that range searches don't all
have off-by-one errors.

This requires there be data files in a format easily
imported into a server.  Any thoughts?  I know Lincoln's
DAS/1 perl server had something like that but the file
format wasn't well defined (at least for my picky tastes ;)

As I recall, writeback support wasn't going to be until
year 2 of the grant so the testing for this year should
only be for format validation and basic integrity checks.

					Andrew
					dalke at dalkescientific.com