[Biojava-l] Re: extra seqDB things to add

Aaron Kitzmiller AKitzmiller@genetics.com
Thu, 13 Jul 2000 15:14:02 -0400


Couple of items about the database discussion...

1. Lion has an object loading API (Java, Perl, and I believe Python are supported) for their v.6 SRS product that allows you to do queries (in Icarus) and retrieve 'objects'.  They aren't really objects; they're essentially attribute hash tables, but it is Java.

2. If the sequence database code is to allow querying, it should be structured as an interface that defines the queries, with specific implementations for the different types of persistence.  Flat-files and SRS are only two of the options.  Viable systems exist that include relational databases (JDBC) and enterprise java beans.  The interface should be independent of the indexing system and the query language.

ajk

Aaron Kitzmiller
Manager Systems Development -Cambridge
Bioinformatics Department
35 Cambridge Park Dr.
Cambridge, MA 02140
Phone: (617) 665-6831
Fax: (617) 665-8870
Email: akitzmiller@genetics.com 


>>> <biojava-l-request@biojava.org> 07/13 12:00 PM >>>
Send Biojava-l mailing list submissions to
	biojava-l@biojava.org 

To subscribe or unsubscribe via the World Wide Web, visit
	http://biojava.org/mailman/listinfo/biojava-l 
or, via email, send a message with subject or body 'help' to
	biojava-l-request@biojava.org 

You can reach the person managing the list at
	biojava-l-admin@biojava.org 

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Biojava-l digest..."


Today's Topics:

  1. pair-wise alignment (Matthew Pocock)
  2. extra seqDB things to add (Matthew Pocock)
  3. Re: extra seqDB things to add (Gerald Loeffler)
  4. Re: extra seqDB things to add (Matthew Pocock)
  5. Re: extra seqDB things to add (Gerald Loeffler)
  6. Re: extra seqDB things to add (Ewan Birney)

--__--__--

Message: 1
Date: Wed, 12 Jul 2000 21:21:50 +0100
From: Matthew Pocock <mrp@sanger.ac.uk>
Organization: The Sanger Center
To: "biojava-l@biojava.org" <biojava-l@biojava.org>
Subject: [Biojava-l] pair-wise alignment

Hi

I have spent today optimizing pairwise alignment. I have got my test
case down from 148 secs to 84 secs, which is good but not great. It
spends as long in file IO as in alignment, and I can't see many more
tricks to speed things up. If you are interested in alignments, could
you run dp.PairwiseAlignment, and tell me if the performance is
acceptable?

In a similar vein, we should add the standard substitution matricies
(pam, blosum and background frequencies in swissprot, embl, trembl etc.)
into the distribution package. Any volunteres?

Roll on 1.0!

Matthew


--__--__--

Message: 2
Date: Thu, 13 Jul 2000 15:46:55 +0100
From: Matthew Pocock <mrp@sanger.ac.uk>
Organization: The Sanger Center
To: "biojava-l@biojava.org" <biojava-l@biojava.org>
Subject: [Biojava-l] extra seqDB things to add

Dear all,

It is great to see the seq.db package up and running. I think that it
needs a couple more things to be the basis for realy useful work:

CachingSequenceDB
- wraps a parent seqDB
- ensures that sequences are fetched once only
- should use weak references (or whatever it is) to be memory-sensitive

FileIndexerSequenceDB
- indexes a list of files
- uses a normal seq.io object to specify the format
- creates a file bla.index with the indexing info
- possibly auto-manages updates using file dates/times

Any thoughts?

Matthew



--__--__--

Message: 3
Date: Thu, 13 Jul 2000 17:33:00 +0200
From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
Reply-To: Gerald.Loeffler@vienna.at 
To: Matthew Pocock <mrp@sanger.ac.uk>, Biojava-l@biojava.org 
Subject: Re: [Biojava-l] extra seqDB things to add



Matthew Pocock wrote:
> FileIndexerSequenceDB
> - indexes a list of files
> - uses a normal seq.io object to specify the format
> - creates a file bla.index with the indexing info
> - possibly auto-manages updates using file dates/times

could use Berkeley DB (which has a Java API) for indexing so as not to
reinvent the wheel...

On the other hand i'm not sure whether it's wise to introduce yet
another indexing mechanism - we already have NCBI-BLAST, WU-BLAST, SRS
which all index the (huge) sequence databases in incompatible ways.
Wouldn't it be better to write a SRSSequenceDB which would be a
SequenceDB that
	o either knows how to decipher the SRS index files and create Sequence
objects from that
	o or (alternatively) knows how to load a sequence file (in e.g. EMBL
format) from the command-line (getz) or web-version of SRS and construct
a Sequence object based on that,
	o or (alternatively) knows how to load a sequence file (in GenBank
format) from Entrez and construct a Sequence object based on that.

	cheers,
	gerald

> 
> Any thoughts?
> 
> Matthew
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org 
> http://biojava.org/mailman/listinfo/biojava-l 

-- 
   Gerald.Loeffler@vienna.at _________________ Software Architect
   http://www.imp.univie.ac.at ____ http://www.daemonstration.com 
   OOA&D, Java, J2EE, JSP, Servlets, JavaBeans, ODBMS, RDBMS, XML

--__--__--

Message: 4
Date: Thu, 13 Jul 2000 16:59:19 +0100
From: Matthew Pocock <mrp@sanger.ac.uk>
Organization: The Sanger Center
To: Gerald.Loeffler@vienna.at 
CC: Biojava-l@biojava.org 
Subject: Re: [Biojava-l] extra seqDB things to add

Hi.

Gerald Loeffler wrote:

> Matthew Pocock wrote:
> > FileIndexerSequenceDB
> > - indexes a list of files
> > - uses a normal seq.io object to specify the format
> > - creates a file bla.index with the indexing info
> > - possibly auto-manages updates using file dates/times
>
> could use Berkeley DB (which has a Java API) for indexing so as not to
> reinvent the wheel...

In practice it's a fairly simple-stupid wheel.

> On the other hand i'm not sure whether it's wise to introduce yet
> another indexing mechanism - we already have NCBI-BLAST, WU-BLAST, SRS
> which all index the (huge) sequence databases in incompatible ways.
> Wouldn't it be better to write a SRSSequenceDB which would be a
> SequenceDB that
>         o either knows how to decipher the SRS index files and create Sequence
> objects from that
>         o or (alternatively) knows how to load a sequence file (in e.g. EMBL
> format) from the command-line (getz) or web-version of SRS and construct
> a Sequence object based on that,
>         o or (alternatively) knows how to load a sequence file (in GenBank
> format) from Entrez and construct a Sequence object based on that.
>
>         cheers,
>         gerald
>

SRSSequenceDB would be great. I like the idea of reading the SRS index files. Are
they inteligable? A FetcherSequenceDB that you parameterize with a little fetch
method and sequence format would also be good to have arround (we could provide
getz & wgetz, efetch etc. implementations).

The indexer is realy amied at the relatively common case where you have 3
fasta-files with your interesting sequences spread among them (exons between 150,
230 nt long from sachDB), and need random access to them. The files are not
integrated to SRS, as only you think that they are interesting, and SRS is scary.
It then allows you to do a getSequence(id), and efficiently pull out the
apropriate chunk of the file. Next week, you blow these files away, and forget all
about them (you now are interested in introns containing repeat elements from
mouse).

Am I trying to create a solution for which there is no problem?

Matthew


--__--__--

Message: 5
Date: Thu, 13 Jul 2000 18:06:11 +0200
From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
Reply-To: Gerald.Loeffler@vienna.at 
To: Matthew Pocock <mrp@sanger.ac.uk>
CC: Biojava-l@biojava.org 
Subject: Re: [Biojava-l] extra seqDB things to add



Matthew Pocock wrote:
> The indexer is realy amied at the relatively common case where you have 3
> fasta-files with your interesting sequences spread among them (exons between 150,
> 230 nt long from sachDB), and need random access to them. The files are not
> integrated to SRS, as only you think that they are interesting, and SRS is scary.
> It then allows you to do a getSequence(id), and efficiently pull out the
> apropriate chunk of the file. Next week, you blow these files away, and forget all
> about them (you now are interested in introns containing repeat elements from
> mouse).

okay - this makes sense. But then i'd make the indexing process
(probably) completely transparent, i.e. i'd index on-the-fly when e.g.
constructing the IndexedSequenceDB from a FASTA-file. The index-files
could be kept in an "invisible" directory and the user would never ever
have to index manually...

and yes, SRS is scary!

> 
> Am I trying to create a solution for which there is no problem?

you have to ask a biologist about that (-:

	gerald
> 
> Matthew

-- 
   Gerald.Loeffler@vienna.at _________________ Software Architect
   http://www.imp.univie.ac.at ____ http://www.daemonstration.com 
   OOA&D, Java, J2EE, JSP, Servlets, JavaBeans, ODBMS, RDBMS, XML

--__--__--

Message: 6
Date: Thu, 13 Jul 2000 17:36:39 +0100 (BST)
From: Ewan Birney <birney@ebi.ac.uk>
To: Gerald Loeffler <Gerald.Loeffler@vienna.at>
cc: Matthew Pocock <mrp@sanger.ac.uk>, Biojava-l@biojava.org 
Subject: Re: [Biojava-l] extra seqDB things to add

On Thu, 13 Jul 2000, Gerald Loeffler wrote:

> 
> 
> Matthew Pocock wrote:
> > FileIndexerSequenceDB
> > - indexes a list of files
> > - uses a normal seq.io object to specify the format
> > - creates a file bla.index with the indexing info
> > - possibly auto-manages updates using file dates/times
> 
> could use Berkeley DB (which has a Java API) for indexing so as not to
> reinvent the wheel...

I'd go the Berkeley DB stuff *and* the SRS/BLast indexing (this is 
what we do in bioperl. Sort of.) 

Berk DB - good for a clean room environment who does not want to install
SRS (anyone who has installed SRS will know what I mean)

SRS - good when you have SRS around.

Talk t the SRS folks as well - they are planning to implement - I
think/hope - the biocorba interfaces. Then you will get that "for free".


> 
> On the other hand i'm not sure whether it's wise to introduce yet
> another indexing mechanism - we already have NCBI-BLAST, WU-BLAST, SRS
> which all index the (huge) sequence databases in incompatible ways.
> Wouldn't it be better to write a SRSSequenceDB which would be a
> SequenceDB that
> 	o either knows how to decipher the SRS index files and create Sequence
> objects from that
> 	o or (alternatively) knows how to load a sequence file (in e.g. EMBL
> format) from the command-line (getz) or web-version of SRS and construct
> a Sequence object based on that,
> 	o or (alternatively) knows how to load a sequence file (in GenBank
> format) from Entrez and construct a Sequence object based on that.
> 
> 	cheers,
> 	gerald
> 
> > 
> > Any thoughts?
> > 
> > Matthew
> > 
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org 
> > http://biojava.org/mailman/listinfo/biojava-l 
> 
> -- 
>    Gerald.Loeffler@vienna.at _________________ Software Architect
>    http://www.imp.univie.ac.at ____ http://www.daemonstration.com 
>    OOA&D, Java, J2EE, JSP, Servlets, JavaBeans, ODBMS, RDBMS, XML
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org 
> http://biojava.org/mailman/listinfo/biojava-l 
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------



--__--__--

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org 
http://biojava.org/mailman/listinfo/biojava-l 


End of Biojava-l Digest_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org 
http://biojava.org/mailman/listinfo/biojava-l