[Bioperl-l] [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL

Peter biopython at maubp.freeserve.co.uk
Mon May 18 13:38:03 UTC 2009


On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 17, 2009, at 8:40 AM, Peter wrote:
>>
>> [...] Here you have mapped RecName and AltName fields in the DE lines to
>> Name and Synonyms (shouldn't that be Synonym singular?).
>
> The example is for the GN lines in SwissProt, not the DE lines.

Ah, that probably explains some of my confusion.

>> In this example, searching the database using one of the SwissProt
>> AltNames (synonyms), or filtering on the Flags sounds like a
>> reasonable request - but this would be very difficult if the data is
>> stored inside XML strings.
>
> Actually no. Modern full-text indexers (inside or outside the database) can
> index XML text columns right away and very well. In fact, for the last
> project that I built a full-text search for (on top of a BioSQL database) I
> did that by writing custom XML documents to a separate table for each
> record I wanted indexed. Oracle's full text indexer did the rest. I also built a
> separate identifier/name/accession index that pulled all the gene names,
> symbols, accession numbers, identifiers etc into a single table for
> indexing.

OK, when I said searching "would be very difficult if the data is
stored inside XML strings", maybe it wasn't so difficult for you - but
that still sounds complicated!

Sticking with the GN lines and the synonym, if this was stored as a
simple tag/value as usual in BioSQL, I would write my SQL statement to
search the annotation table where the term id was that associated with
a GN synonym, and the annotation value was "HABP1".  Simple.

Using the XML approach, are you suggesting you could do a full text
search on the annotation value field, looking for any rows where the
field contains "<Synonyms>HABP1</Synonyms>", where the term id matches
the GN lines' XML string? This sounds simplistic and probably rather
slow - presumably why you resorted to the more complicated indexing
scheme described above?

> What I mean is, a fully normalized relational representation, especially if
> nested, is often not the most efficient data structure for efficient
> searching and filtering.

OK.  But do we really need to worry about complex nested structures
for the SwissProt annotation (or in general)?

Peter



More information about the Bioperl-l mailing list