[Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL

Hilmar Lapp hlapp at gmx.net
Sun May 17 15:21:59 UTC 2009

On May 17, 2009, at 8:40 AM, Peter wrote:

> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> On May 16, 2009, at 7:28 PM, Peter wrote:
>>>> That could be changed to an XML string:
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <gene_names>
>>>> <gene_name>
>>>>  <Name>GC1QBP</Name>
>>>>  <Synonyms>HABP1</Synonyms>
>>>>  <Synonyms>SF2P32</Synonyms>
>>>>  <Synonyms>C1QBP</Synonyms>
>>>> </gene_name>
>>>> </gene_names>
>>>> Thinking about this we should attempt to coalesce around a standard
>>>> instead of forcing the other Bio*  to a specific format.
> [...] Here you have mapped RecName and AltName fields in the DE  
> lines to
> Name and Synonyms (shouldn't that be Synonym singular?).

The example is for the GN lines in SwissProt, not the DE lines.

> [...]
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Not necessarily. If you have a flat serialization (such as XML) the  
>> nested
>> structure isn't needed. Of course that's not a fully normalized  
>> relational
>> representation, but if you had one, how often would it be used, how
>> efficient would those queries be (SQL is poor at nested or  
>> recursive data
>> structures), and how much pain would it be to write the object- 
>> relational
>> mappings?
> In this example, searching the database using one of the SwissProt
> AltNames (synonyms), or filtering on the Flags sounds like a
> reasonable request - but this would be very difficult if the data is
> stored inside XML strings.

Actually no. Modern full-text indexers (inside or outside the  
database) can index XML text columns right away and very well. In  
fact, for the last project that I built a full-text search for (on top  
of a BioSQL database) I did that by writing custom XML documents to a  
separate table for each record I wanted indexed. Oracle's full text  
indexer did the rest. I also built a separate identifier/name/ 
accession index that pulled all the gene names, symbols, accession  
numbers, identifiers etc into a single table for indexing.

What I mean is, a fully normalized relational representation,  
especially if nested, is often not the most efficient data structure  
for efficient searching and filtering.

: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :

More information about the Biopython-dev mailing list