[Biojava-l] BioJava 3 Begins - Volunteers please!

Mon Oct 20 00:18:29 UTC 2008

Hi all,

I've just committed some new code to the biojava3 branch of the biojava-live
subversion repository. It's the foundations of a brand new alphabet+symbol
set of classes, and an example of how to use them to represent DNA. You'll
notice that the new code is very lightweight and allows for a lot more
flexibility than the old code - for instance, the concept of Alphabet has
changed radically. It also makes much more extensive use of the Collections
API.

I haven't got any test cases or usage examples yet but give me a shout if
you don't understand the code and I'll explain how it works. (Hint:
SymbolFormat is there to convert Strings into SymbolList objects, and vice
versa).

So, now we want some volunteers! We're starting from scratch here so there's
a lot of work to do. The whole of BioJava needs 'translating' into BJ3,
whether it be copy-and-paste existing classes and modify them to suit the
new style, or write completely new ones to provide equivalent functionality.

I'll post an example of how to do file parsing soon, probably starting with
FASTA. In the meantime, a good place to start would be for people to design
object models to represent their favourite data types (e.g. Genbank, or
microarray data). Utility classes to manipulate those objects would be great
too.

The object models need to be normalised as much as possible - e.g. if your
data has a lot of comments, and the order of those comments is important,
then give your object model a collection of comment objects. The object
model for each data type should be completely independent and use basic data
types wherever possible (e.g. store sequences as strings, don't attempt to
parse them into anything fancy like SymbolLists). The closer the object
model is to the original data format, the better. There's going to be clever
tricks when it comes to converting data between different object models
(e.g. Genbank to INSDSeq), which I will explain later when I put the file
parsing examples up.

You'll notice how the biojava3 branch uses Maven instead of Ant. This is
because we want to make it as modular as possible, so if you want to write
microarray stuff, create a new microarray sub-project (as per the dna
example that's already there). This way if someone only wants the microarray
bit of BJ3, they only need install the appropriate JAR file and can ignore
the rest. (The 'core' module is for stuff that is so generic it could be
used anywhere, or is used in every single other module.)

If coding isn't your cup of tea, then we would very much welcome testers
(particularly those who enjoy writing test cases!), documenters
(particularly code commenters), translators (for internationalisation of the
code), and of course all those who wish to contribute ideas and suggestions
no matter how off-the-wall they might be. In particular if you'd like to
take charge of an area of the development process, e.g. Documentation Chief,
or Protein Champion, then that would be much appreciated.

I'm very much looking forward to working with everyone on this. Good luck,
and happy coding!

cheers,
Richard

PS. Please don't forget to attach the appropriate licence to your code. You
can copy-and-paste it from the existing classes I just committed this
evening.

PPS. For those who are worried about backwards compatibility - this was
discussed on the lists a while back and it was made clear that BJ3 is a
clean break. However, the existing code will continue to be maintained and
bugfixed for a couple of years so you don't have to upgrade if you don't
want to - it just won't have any new features developed for it. This is
largely because it'll probably take just that long to write all the new BJ3
code. When we do decide to desupport the existing BJ code, plenty of notice
will be given (i.e. years as opposed to months).

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/