[BioSQL-l] dates and terms

Hilmar Lapp hlapp at gnf.org
Fri Apr 1 15:25:55 EST 2005


I agree but I'm also a reluctant to expand the 'official' schema, 
because 1) it would necessarily introduce redundancy, 2) it is solely 
needed for query optimization, and 3) for most RDBMSs you can probably 
come up with a non-invasive optimization on top of the official schema 
that is easy enough to implement, like I showed for Oracle.

Actually, depending on the level of individual expertise argument 3) 
may border on arrogance so I'll retract it immediately. So maybe what 
could efficiently help people solve this and similar problems is 
supplementary SQL code in the repository, for instance organized by 
problem or use case. So for the date problem, there would be SQL 
scripts for each supported platform that implement a solution without 
altering the core schema, like the one I suggested for Oracle and 
augmented with a function-based index DDL so that even if you didn't 
know PL/SQL and your DBA were yourself you could just run the script 
and be on your way.

BTW doing something like

	SQL> ALTER TABLE bioentry_qualifier_value ADD (date_value DATE);

followed by properly populating the column from the VARCHAR-type value 
column in my opinion still counts as a non-invasive optimization, 
because unless someone used SELECT * (which is always a bad idea 
anyway) this won't break anything. If the RDBMS supports triggers, you 
can write a trigger that automatically creates and maintains the value 
of the additional column depending on the value of the value column.

And to make the separation tidy and obvious, you could also create a 
table

CREATE TABLE bioentry_date (
	bioentry_id INTEGER NOT NULL,
	term_id	INTEGER NOT NULL,
	rank NUMBER(3,0) NOT NULL,
	date_value DATE)
)

and then use the same method as before to populate and maintain the 
table automatically. (for brevity I obviously left out UK and FK 
constraints, but they'd be analogous to bioentry_qualifier_value.)

So, bottom line of what I'm saying is that usually if a problem 
pertains to optimization rather than a model deficiency, there'll be an 
array of options to solve the problem without altering the model, and 
so I'll be reluctant to alter the model.

If people disagree (or agree) with this view please let me know, it'd 
be good to know where people generally stand on such questions, and 
what poses a problem and what doesn't.

	-hilmar

On Mar 31, 2005, at 5:00 PM, mark.schreiber at novartis.com wrote:

> Hello -
>
> I guess this is the nearest approximation to a date field. It might be
> something worth considering for a later version of bioSQL as pretty 
> much
> all records have one or more dates attached to them.
>
> - Mark
>
>
>
>
>
> Hilmar Lapp <hlapp at gnf.org>
> 04/01/2005 04:39 AM
>
>
>         To:     Mark Schreiber/GP/Novartis at PH
>         cc:     biosql-l at open-bio.org
>         Subject:        Re: [BioSQL-l] dates and terms
>
>
> Bioperl-db stores these similarly, but the term is 'date_changed' which
> basically comes from Bioperl's Bio::Seq::RichSeq.
>
> You can compare these dates but it's hard to do so universally for a
> search against the database. There is a scriptlet
> scripts/biosql/update-on-new-date.pl in the bioperl-db repository that
> shows a pretty straightforward approach for comparison. It uses
> Date::Parse which does a nice job of detecting most date formats
> automatically.
>
> The formats being used are actually I believe not dramatically
> different. In UniProt, they look like the following:
>
> DT   01-NOV-1995 (Rel. 32, Created)
> DT   01-OCT-1996 (Rel. 34, Last sequence update)
> DT   28-FEB-2003 (Rel. 41, Last annotation update)
>
> and these get stored as an array with the following elements
>
> 01-NOV-1995 (Rel. 32, Created)
> 01-OCT-1996 (Rel. 34, Last sequence update)
> 28-FEB-2003 (Rel. 41, Last annotation update)
>
> Date::Parse will just ignore the non-date stuff in parentheses. I don't
> know whether there's a similarly convenient library in Java.
>
> In Oracle you can specify the date format when converting. So, the
> following would take everything up to the first space character and
> convert it assuming the format used above:
>
>    1  select to_date(decode(instr('01-NOV-1995 (Rel. 32, Created)',' 
> '),
>    2                        0,
>    3                        '01-NOV-1995 (Rel. 32, Created)',
>    4                        substr('01-NOV-1995 (Rel. 32, Created)',
>    5                               1,
>    6                               instr('01-NOV-1995 (Rel. 32,
> Created)',
>    7                                     ' ')
>    8                              )
>    9                       ),
>   10                 'dd-mon-yyyy')
>   11* from dual
> SQL> /
>
> TO_DATE(DECODE(IN
> -----------------
> 11/01/95 00:00:00
>
> 1 row selected.
>
> The DECODE() protects from cases when there is nothing following the
> date.
>
> If this looks too messy, hide it in a function:
>
>    1  CREATE OR REPLACE
>    2  FUNCTION biosql_to_date(qual_value IN VARCHAR2,
>    3                          date_format IN VARCHAR2 DEFAULT
> 'dd-mon-yyyy')
>    4  RETURN DATE
>    5  IS
>    6     spacepos INTEGER;
>    7  BEGIN
>    8     spacepos := INSTR(qual_value,' ');
>    9     IF spacepos = 0 THEN
>   10             RETURN TO_DATE(qual_value,date_format);
>   11     END IF;
>   12     RETURN TO_DATE(SUBSTR(qual_value,1,spacepos),
>   13                    date_format);
>   14* END;
> SQL> /
>
> Function created.
>
> Elapsed: 00:00:00.60
> SQL> select biosql_to_date('01-NOV-1995 (Rel. 32, Created)') from dual;
>
> BIOSQL_TO_DATE('0
> -----------------
> 11/01/95 00:00:00
>
> 1 row selected.
>
> Elapsed: 00:00:00.01
> SQL> select biosql_to_date('01-NOV-1995') from dual;
>
> BIOSQL_TO_DATE('0
> -----------------
> 11/01/95 00:00:00
>
> 1 row selected.
>
> Elapsed: 00:00:00.01
>
> Obviously, if you query using this or a similar function, the query
> optimizer will do a full table scan and not use an index on
> bioentry_qualifier_value. However, you can create a function index on
> bioentry_qualifier_value.value using the above function, and queries
> using the same function will then be indexed. In that case you would
> need to make a small amendment to the function above by catching the
> exception that results from parsing strings that aren't dates and then
> return NULL instead. (Oracle does not index NULLs. Unlike in
> PostgreSQL, you cannot have partial indexes in Oracle AFAIK, i.e., in
> Oracle the index creation statement cannot contain a WHERE clause.)
>
> Does this help?
>
>                  -hilmar
>
> On Mar 31, 2005, at 1:07 AM, mark.schreiber at novartis.com wrote:
>
>> Hello -
>>
>> Many records that might be stored in BioSQL have associated date
>> fields.
>> Biojava stores these as value in bioentry_qualifier_value with the
>> term_id
>> pointing to the Term for date.
>>
>> This seems to place a serious limitation on searching by date. I would
>> like to be able to search for sequences entered between X and Y or
>> before
>> X etc. Has anyone come up with a workaround for date operations on
>> VarChar2 or Strings?
>>
>> Thanks
>>
>> Mark Schreiber
>> Principal Scientist (Bioinformatics)
>>
>> Novartis Institute for Tropical Diseases (NITD)
>> 10 Biopolis Road
>> #05-01 Chromos
>> Singapore 138670
>> www.nitd.novartis.com
>>
>> phone +65 6722 2973
>> fax  +65 6722 2910
>>
>>
>> ______________________________________________________________________
>> The Novartis email address format has changed to
>> firstname.lastname at novartis.com.  Please update your address book
>> accordingly.
>> ______________________________________________________________________
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at open-bio.org
>> http://open-bio.org/mailman/listinfo/biosql-l
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
>
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list