[Biojava-dev] The future of BioJava

Andy Yates ayates at ebi.ac.uk
Mon Sep 24 08:26:27 UTC 2007


Hi Mark,

> 
>    2. Andy brought up the point of people who create non-standard
>    variations of EMBL-formatted files.  I was wondering if these files were
>    created in programming languages other than Java?  If so, would those users
>    be willing to use a Jython, JRuby, or a Perl-like scripting language like
>    Sleep,?  This would allow them to use biojava as a library, and still use a
>    scripting language whose syntax they were familiar with.  They would also be
>    producing files in a more standardized format.  This might cut down on the
>    number of parsing mistakes caused by "unsupported" file variations.  You can
>    go to http://scripting.dev.java.net for more information on the
>    scripting languages that the Java VM supports.
> 
>    3. Was there any reason why non-standard files were being created?
>    Perhaps some use-case not being covered?

These files are not being created by accident just the groups that are 
producing them have different requirements wrt the data they release. So 
they want to produce EMBL flat files which have the look/markup of an 
EMBL record yet do not follow the same rules as EMBL. A good example (if 
memory serves me correctly) is UniProtKB. The specification of UniProtKB 
records are different to EMBL yet both output files of a similar markup. 
  So it's not so much as biojava no supporting a use-case or a group 
producing flat files with a custom writer just they have a different 
requirement. Getting BioJava to support them all is just a non-starter 
considering the number of projects available. A better way is just to 
let people plug into the grammer/objects for parsing these file formats 
& then groups can choose to release their parsing code or not.


>    4. If BioJava is split up into a variety of smaller JARs, how would
>    you insure that the users had all of the JARs that they needed?  Would an
>    installer be provided to allow users to select groups of JARs?  There are a
>    number of open source installers that would make this process easier.  Using
>    Maven is suitable if you're a developer, if you're a scripter it's a little
>    more difficult to deal with.

Yes that's very true. We're encountering similar problems in our group 
where we have a set of people working on new maven projects & older 
projects still using Ant. Our solution atmo is producing maven 
assemblies which cover different use cases & end users choose which one 
suits their needs the most.

If we're talking about scripters though then it's probably easier to 
have it written on the wiki with a 'first steps' in the major JVM 
scripting languages (I'm thinking Groovy, JavaScript & JRuby should 
cover the bases).


>    5. Are there any thoughts about using a templating system like
>    Velocity, FreeMarker or JST?  This would make it easier to insure that files
>    were produced in a standard fashion.  It would also make it easier to
>    maintain support for writing files in different file formats.

I'd prefer to use StringTemplate (just because it's a push based 
templating system not a pull like Velocity) but yeah I can see it being 
very useful.

>    6. When it comes to unit testing and continuous building, is the
>    bio*.org server going to handle that automated build & burn, or is someone
>    in the group going to have to do it?  I think the inability to have the
>    build setup on the server had us stymied before.

I think that Andreas is in a better position to answer this one maybe 
but I'm guessing we can schedule the builds on a time basis along with 
building on each commit into the repository.

>    7. Now that Java also includes the Derby database, and the Java
>    Persistence API (JPA), has anyone considered migrating the BioSQL support
>    from Hibernate to JPA, and using Derby as the default database?  This would
>    make it a little easier to maintain and would minimize the setup work that a
>    new user would have to do.

Hibernate supports JPA so the switch shouldn't be hard to do if needed. 
That said Hibernate is still the 'market leader' when it comes to Java 
based persistence so I'm not to worried about this.

> 
>    8. Richard, you mention in the "Reasoning" section that "users have
>    moved on".  What types of use-cases beyond basic sequence analysis, should
>    BioJava support?  Would support for more of lab-related processes expand the
>    user base and number of committers?  Would support for parsing different
>    types of instrument files be a useful addition? I could imagine use cases
>    where users would like to be able to parse an Affy file and fetch probe
>    information, gene information, and perhaps pathway data.

I'm already aware of people doing the Affy parsing themselves (I was 
involved with writing the parsers for their XDA data format ... bloody 
unsigned big endian ints) but the code was never incorporated into 
biojava because the group wasn't 100% comfortable about releasing the 
code. But yes there are a lot of other use cases out there that I'm sure 
we're unaware of. Our only choice is to see if we can get people to 
contribute ideas to this stage of development & give people the 
opportunity to contribute code as & when it's required.

> 
>    9. Are there any thoughts about using annotations (perhaps in
>    combination with ontologies) to handle semantic validation of arguments?
>    For example, you might have an annotation like
> 
> @id {ontologyURI="http://www.mygrid.org.uk/ontology#LocusLink_record_id"}
> 
> indicating that the attribute or method argument is a LocusLink id.
> 

That's quite an interesting idea. Not sure about where else to introduce 
them in if they are required but it's a good idea :)

Andy



More information about the biojava-dev mailing list