[Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL

Sun May 24 04:10:28 UTC 2009

I suggest that for the short term, we store the DE lines as one string in the same way as Bioperl 1.5 and 1.6, until we decide on a more advanced way to treat these lines. Currently Bio.SeqIO and Bio.SwissProt use different ways to handle the DE lines, and neither of them agrees with Bioperl.

--Michiel.

--- On Mon, 5/18/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL
> To: "Hilmar Lapp" <hlapp at gmx.net>
> Cc: "Chris Fields" <cjfields at illinois.edu>, "BioPerl List" <bioperl-l at lists.open-bio.org>, "biosql-l" <biosql-l at lists.open-bio.org>, biopython-dev at biopython.org
> Date: Monday, May 18, 2009, 9:38 AM
> On Sun, May 17, 2009 at 4:21 PM,
> Hilmar Lapp <hlapp at gmx.net>
> wrote:
> >
> > On May 17, 2009, at 8:40 AM, Peter wrote:
> >>
> >> [...] Here you have mapped RecName and AltName
> fields in the DE lines to
> >> Name and Synonyms (shouldn't that be Synonym
> singular?).
> >
> > The example is for the GN lines in SwissProt, not the
> DE lines.
> 
> Ah, that probably explains some of my confusion.
> 
> >> In this example, searching the database using one
> of the SwissProt
> >> AltNames (synonyms), or filtering on the Flags
> sounds like a
> >> reasonable request - but this would be very
> difficult if the data is
> >> stored inside XML strings.
> >
> > Actually no. Modern full-text indexers (inside or
> outside the database) can
> > index XML text columns right away and very well. In
> fact, for the last
> > project that I built a full-text search for (on top of
> a BioSQL database) I
> > did that by writing custom XML documents to a separate
> table for each
> > record I wanted indexed. Oracle's full text indexer
> did the rest. I also built a
> > separate identifier/name/accession index that pulled
> all the gene names,
> > symbols, accession numbers, identifiers etc into a
> single table for
> > indexing.
> 
> OK, when I said searching "would be very difficult if the
> data is
> stored inside XML strings", maybe it wasn't so difficult
> for you - but
> that still sounds complicated!
> 
> Sticking with the GN lines and the synonym, if this was
> stored as a
> simple tag/value as usual in BioSQL, I would write my SQL
> statement to
> search the annotation table where the term id was that
> associated with
> a GN synonym, and the annotation value was "HABP1". 
> Simple.
> 
> Using the XML approach, are you suggesting you could do a
> full text
> search on the annotation value field, looking for any rows
> where the
> field contains "<Synonyms>HABP1</Synonyms>",
> where the term id matches
> the GN lines' XML string? This sounds simplistic and
> probably rather
> slow - presumably why you resorted to the more complicated
> indexing
> scheme described above?
> 
> > What I mean is, a fully normalized relational
> representation, especially if
> > nested, is often not the most efficient data structure
> for efficient
> > searching and filtering.
> 
> OK.  But do we really need to worry about complex
> nested structures
> for the SwissProt annotation (or in general)?
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>