[Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL

Sun May 17 12:40:47 UTC 2009

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 7:28 PM, Peter wrote:
> > > That could be changed to an XML string:
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <gene_names>
> > >  <gene_name>
> > >   <Name>GC1QBP</Name>
> > >   <Synonyms>HABP1</Synonyms>
> > >   <Synonyms>SF2P32</Synonyms>
> > >   <Synonyms>C1QBP</Synonyms>
> > >  </gene_name>
> > > </gene_names>
> > >
> > > Thinking about this we should attempt to coalesce around a standard
> > > instead of forcing the other Bio*  to a specific format.

Absolutely - some common standard should be agreed.

Would you envision doing this for other structured fields, inventing a
new mini XML format each time?  That seems open ended and likely to
cause a lot of work keeping all the Bio* project synchronised.

Here you have mapped RecName and AltName fields in the DE lines to
Name and Synonyms (shouldn't that be Synonym singular?).  I also don't
get why you have used a gene_name entry inside a gene_names list.
Would you hold the contains information and the flags information from
the DE lines in separate XML entries?

I would have gone for something much closer to the original DE line
markup i.e. using the field names UniProt use, RecName and AltName,
rather than mapping these to Name and Synonym.

> > How would you record this in BioSQL?  As an XML string for an annotation
> > value?
>
> Yes. A TagTree object can be serialized to XML, and the XML can be stored
> as the annotation value in BioSQL. As the XML can be read back in, it allows
> full round-tripping.

Assuming you stored all the DE markup, then yes, a round trip back to
the SwissProt file could be possible.  And, depending on the details
of the XML structure used, it would be possible to represent this in a
python structure too.

> > Brad has suggested JSON might be useful for this kind of thing (see
> > also per-letter-annotation discussion).
>
> JSON could be another serialization format, but XML is equally or better
> supported in all languages except JavaScript. Furthermore, you could just
> send the XML to the browser and have an XSLT (either directly, or indirectly
> through JavaScript doing the transformation) do the rendering.

I have no strong preference for either XML or JSON (but would rather
avoid them if they are not really needed).  For other types of
annotation there may be a clearer advantage for one over the other,
e.g. per letter annotation like the secondary structure of a protein
sequence, or the quality scores of a nucleotide contig.

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Not necessarily. If you have a flat serialization (such as XML) the nested
> structure isn't needed. Of course that's not a fully normalized relational
> representation, but if you had one, how often would it be used, how
> efficient would those queries be (SQL is poor at nested or recursive data
> structures), and how much pain would it be to write the object-relational
> mappings?

In this example, searching the database using one of the SwissProt
AltNames (synonyms), or filtering on the Flags sounds like a
reasonable request - but this would be very difficult if the data is
stored inside XML strings.

Of course, because the RecName and AltName entries are top level, we
could just record them as normal - simple strings in the annotations
table.  This seems much nicer.  Likewise the "Flags: Precursor;" line.
 i.e. listing the tag/value pairs which could be used in the
bioentry_qualifier_value table:

AltName = "Full=11S globulin seed storage protein II"
AltName = "Full=Alpha-globulin"
Flags = "Precursor"

(the RecName field, "Full=11S globulin seed storage protein 2", could
be used for the bioentry.description instead)

The above are all pretty easy.  We only need to consider nesting (or
something like XML or JSON) for some of the DE information, in the
example discussed the Contains lines.  Even this could be even be done
by storing each contains entry as a single long string (holding both
the name and synonyms) directly from the DE line itself, something
like this:

Contains = "RecName: Full=11S globulin seed storage protein 2 acidic
chain;\nAltName: Full=11S globulin seed storage protein II acidic
chain;"
Contains = "RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"

Peter