[Bioperl-l] Bio::Seq -> Solr (Lucene) ?

Wed Aug 29 22:11:55 UTC 2007

Please slap me if I'm hysterical.

I'm seeking a broad bioinformatics search engine platform. I want to 
take gobs of data in gobs of formats and allow people to search it on 
the web.

- Entrez is awesome. Unfortunately I don't see anything in the NCBI 
toolkit that helps me run my own version of it. Even a tiny one. After 
an initial "check out our toolkit" response from NCBI I don't seem to be 
getting anywhere. Maybe I'm not communicating enough or well enough.

- EB-eye Search is slick. I don't see any developer kit or source code 
of any kind and I've gotten no response to my emails to them.

- LuceGene is very cool. But it looks like no one has touched it in 2.5 
years and I've gotten no response from their contact email address. I'm 
especially intrigued by their

  src/LuceGene/src/org/eugenes/index/LuceneReadseqIndexer.java

which seems to use the rather popular(?) Java Readseq to populate Lucene 
with source data in all sorts of different formats.

I don't know Java.

- Solr is really neat. It's easy to install and gives a simple/powerful 
XML API to populate a Lucene index.

... so ...

I'm thinking BioPerl knows how to parse lots of formats into a Bio::Seq.

I'm thinking I could write Perl which would take a Bio::Seq object and 
convert it to an XML file which Solr would happily inject into Lucene 
for me.

If I could do that I'm thinking that any of the many formats that 
Bio::SeqIO can slurp could magically be sent into a Lucene index for 
searching.

I'm thinking that would be really cool and I'm going to write it.

Now's your chance to slap me.

Since I haven't started yet, what would I call this thing? 
Bio::SeqIO::Solr?  (and I wouldn't implement the I part?)

Thanks,

Jay Hannah
http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah

More notes:
http://clab.ist.unomaha.edu/CLAB/index.php/RT11