[BioSQL-l] dates and terms

Hilmar Lapp hlapp at gnf.org
Thu Mar 31 15:39:36 EST 2005


Bioperl-db stores these similarly, but the term is 'date_changed' which 
basically comes from Bioperl's Bio::Seq::RichSeq.

You can compare these dates but it's hard to do so universally for a 
search against the database. There is a scriptlet 
scripts/biosql/update-on-new-date.pl in the bioperl-db repository that 
shows a pretty straightforward approach for comparison. It uses 
Date::Parse which does a nice job of detecting most date formats 
automatically.

The formats being used are actually I believe not dramatically 
different. In UniProt, they look like the following:

DT   01-NOV-1995 (Rel. 32, Created)
DT   01-OCT-1996 (Rel. 34, Last sequence update)
DT   28-FEB-2003 (Rel. 41, Last annotation update)

and these get stored as an array with the following elements

01-NOV-1995 (Rel. 32, Created)
01-OCT-1996 (Rel. 34, Last sequence update)
28-FEB-2003 (Rel. 41, Last annotation update)

Date::Parse will just ignore the non-date stuff in parentheses. I don't 
know whether there's a similarly convenient library in Java.

In Oracle you can specify the date format when converting. So, the 
following would take everything up to the first space character and 
convert it assuming the format used above:

   1  select to_date(decode(instr('01-NOV-1995 (Rel. 32, Created)',' '),
   2                        0,
   3                        '01-NOV-1995 (Rel. 32, Created)',
   4                        substr('01-NOV-1995 (Rel. 32, Created)',
   5                               1,
   6                               instr('01-NOV-1995 (Rel. 32, 
Created)',
   7                                     ' ')
   8                              )
   9                       ),
  10                 'dd-mon-yyyy')
  11* from dual
SQL> /

TO_DATE(DECODE(IN
-----------------
11/01/95 00:00:00

1 row selected.

The DECODE() protects from cases when there is nothing following the 
date.

If this looks too messy, hide it in a function:

   1  CREATE OR REPLACE
   2  FUNCTION biosql_to_date(qual_value IN VARCHAR2,
   3                          date_format IN VARCHAR2 DEFAULT 
'dd-mon-yyyy')
   4  RETURN DATE
   5  IS
   6     spacepos INTEGER;
   7  BEGIN
   8     spacepos := INSTR(qual_value,' ');
   9     IF spacepos = 0 THEN
  10             RETURN TO_DATE(qual_value,date_format);
  11     END IF;
  12     RETURN TO_DATE(SUBSTR(qual_value,1,spacepos),
  13                    date_format);
  14* END;
SQL> /

Function created.

Elapsed: 00:00:00.60
SQL> select biosql_to_date('01-NOV-1995 (Rel. 32, Created)') from dual;

BIOSQL_TO_DATE('0
-----------------
11/01/95 00:00:00

1 row selected.

Elapsed: 00:00:00.01
SQL> select biosql_to_date('01-NOV-1995') from dual;

BIOSQL_TO_DATE('0
-----------------
11/01/95 00:00:00

1 row selected.

Elapsed: 00:00:00.01

Obviously, if you query using this or a similar function, the query 
optimizer will do a full table scan and not use an index on 
bioentry_qualifier_value. However, you can create a function index on 
bioentry_qualifier_value.value using the above function, and queries 
using the same function will then be indexed. In that case you would 
need to make a small amendment to the function above by catching the 
exception that results from parsing strings that aren't dates and then 
return NULL instead. (Oracle does not index NULLs. Unlike in 
PostgreSQL, you cannot have partial indexes in Oracle AFAIK, i.e., in 
Oracle the index creation statement cannot contain a WHERE clause.)

Does this help?

	-hilmar

On Mar 31, 2005, at 1:07 AM, mark.schreiber at novartis.com wrote:

> Hello -
>
> Many records that might be stored in BioSQL have associated date 
> fields.
> Biojava stores these as value in bioentry_qualifier_value with the 
> term_id
> pointing to the Term for date.
>
> This seems to place a serious limitation on searching by date. I would
> like to be able to search for sequences entered between X and Y or 
> before
> X etc. Has anyone come up with a workaround for date operations on
> VarChar2 or Strings?
>
> Thanks
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax  +65 6722 2910
>
>
> ______________________________________________________________________
> The Novartis email address format has changed to
> firstname.lastname at novartis.com.  Please update your address book
> accordingly.
> ______________________________________________________________________
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list