[Biojava-dev] The future of BioJava

Andy Yates ayates at ebi.ac.uk
Thu Sep 20 10:54:31 UTC 2007


I think my EMBL point was more about groups like mine which distribute 
data in an EMBL format but we do not follow the EMBL rules 100% about 
what elements can follow other elements. Customization is very important 
to us which at the moment means there is a biojava src checkout here 
which gets edited accordingly. Not the most useful/nice solution but it 
works & is something I've had to do before when I was working with 
chromatograms.

Most of the work I've done with Biojava sequences where just to push in 
a DNA sequence, rev comp it and push it back out. Even then that got 
dropped as someone in-house made their own version which kept it all in 
Strings. That said it should have been used more since it was a DNA 
alignment/sequencing project & all positions work WRT index 1 (you don't 
what to know how many times I typed in -1 in that project ... and the 
number of bugs it caused).

Anyway I guess what I'm getting round to saying in a very bad way is 
that there are places where I should have used the sequence 
representations from biojava but the inital hump/learning curve of what 
they are, how to use them & why to use them was too large and I have too 
little time. I'm sure there are so many other people in the community 
which have this same problem and I'm sure they'll be hurting because of 
it as much as I did (and if anyone from that group is reading this email 
I do apologize ... again).

Andy

Mark Schreiber wrote:
> The main value of the Symbol representation comes in when you do
> Distributions and DP which is really why Matthew and Thomas developed
> it. Quite probably why they developed biojava at all.  If you are just
> pushing data around which seems to be most applications then Strings
> are better.
> 
> I have previously proposed seperating the Symbol, Alphabet, DP and
> Dist from the rest of the packages because they have value well beyond
> biology but an equal argument would be that most bio stuff doens't
> need this level of analysis. If you only want to convert EMBL to Fasta
> or read a BLAST result you don't need it.
> 
> For those who want to read in EMBL and compute some Distribution or
> run a Hidden Markov Model then I would propose the conversion of
> Stringy sequences to SymbolLists at the point when it is needed not at
> the point when you read them in.  Given that almost all I/O of
> sequence starts and ends as a String the point where you convert to
> Symbols doesn't matter much. The only question is do you need to
> convert to Symbols for the analysis you are doing?
> 
> (Sorry for not putting this on the wiki, I'll do it later).
> 
> - Mark
> 
> On 9/20/07, Richard Holland <holland at ebi.ac.uk> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> This is one of my main bugbears too. I've never quite understood why we
>> can't just use Strings, and resort to SymbolLists only when more
>> advanced manipulation is required (e.g. quality scores for each base).
>> After all, a String is a memory word overhead (32- or 64-bits) plus
>> 16-bits (unicode) per character, but most SymbolList implementations are
>> a memory word overhead plus an additional entire memory word per Symbol,
>> each word being a pointer to the memory location where the Symbol
>> singleton lives. So SymbolLists actually use more memory than Strings,
>> not less.
>>
>> (This is not true for CompressedSymbolList which represents sequences as
>> a sequence of bits, grouped into groups large enough to uniquely
>> identify any single symbol in the alphabet - e.g. 2 bits for DNA).
>>
>> As you say, most users just want to read a sequence, sublist it, maybe
>> reverse comp it or run some simple search over it. This can all easily
>> be achieved straight from String format.
>>
>> The other 'category A' problems are equally important. Could you add a
>> section to the Wiki about these and the 'category B' problems?  Then we
>> can use this as a priority use-case list when it comes to actual
>> development.
>>
>> cheers,
>> Richard
>>
>>
>> Andy Yates wrote:
>>> Hi,
>>>
>>> I would say yes to this as well. It is very important to know what green
>>> people are attempting to do with BioJava rather than us assuming that we
>>> know :). There are parts in BioJava where the flexibility of the code is
>>> not sufficient for other people who want to use the code base & in other
>>> areas too flexible.
>>>
>>> I've talked to quite a few people over the years who have used biojava
>>> for simple & complex applications and they all seem to come back round
>>> to a few key problems:
>>>
>>> * Sequence & SymbolLists are strange and why can't I use a String - All
>>> of this makes a lot more sense if you know about the flyweight pattern;
>>> if not it just seems very strange.
>>>
>>> * I have a format that's EMBL like. Can I parse it using Biojava?
>>>
>>> * How do I read in a FASTA file?
>>>
>>> * How can I get X from this chromatogram & can I parse my specific trace
>>> format into a BioJava object?
>>>
>>> As Andreas said it's the occurrence of the category A problems that are
>>> the most worrying. In terms of sequences I think I can see why people
>>> have a problem with it.
>>>
>>> Just if we take this as an example:
>>>
>>> I have my DNA sequence in a String I can substring it, perform a regular
>>> expression over it, replace sections, pad it out, format it & so on. If
>>> I have a Sequence object I can perform most of these actions but the
>>> interface to them seems unintuitive. Things like calling seqString() to
>>> get the String back out from a sequence rather than calling toString().
>>> Also lets say I want to use a sequence as a key in a hash map or ask if
>>> two sequences are equal (using the old sequence objects) ... at the
>>> moment I'd have to convert Sequence -> String to perform the comparison
>>> (and that doesn't include checking a Sequence for alphabet equality).
>>>
>>> I know this sounds like nit-picking & for people who have used biojava
>>> extensively a lot of this makes sense. For someone new to the project it
>>> seems like we've done something just for the sake of it and we need to
>>> get rid of that feeling which I'm sure will happen if we address the
>>> category A problem. The rest will fall into place :)
>>>
>>> Andy
>>>
>>> Richard Holland wrote:
>>> I totally agree.
>>>
>>> Can you post a short summary of this to the Wiki page?
>>>
>>> Not all aspects of BioJava are documented, leading people either to give
>>> up, consult the JavaDocs online, or post a message to biojava-l or
>>> biojava-dev.
>>>
>>> Is it possible to get similar stats to the ones you have calculated for
>>> the JavaDoc pages on our website?
>>>
>>> Also, is it possible to build some kind of index over the mailing list
>>> archives to pull out the most frequently used terms?
>>>
>>> cheers,
>>> Richard
>>>
>>> Andreas Prlic wrote:
>>>>>> Hi,
>>>>>>
>>>>>> A question related to the discussion of how to design a future BioJava
>>>>>> is to have a look
>>>>>> at which parts of BioJava are being actively used and how to improve
>>>>>> these.
>>>>>>
>>>>>> So what are the most frequently used bits of BioJava? One way to look at
>>>>>> this is to go to the
>>>>>> web-stats and see how many hits we have got on our documentation web
>>>>>> pages.
>>>>>>
>>>>>> In an ideal world BioJava would be so simple to use, that nobody needs
>>>>>> to read any docu.
>>>>>> Unfortunately we are far away from this, so actually looking at these
>>>>>> stats gives an impression
>>>>>> on
>>>>>>
>>>>>> * topics / functionality which are of particular interest to the
>>>>>> community
>>>>>> * topics / functionality which might not be straightforward to use,
>>>>>> therefore there are many hits on these pages.
>>>>>>
>>>>>> A look at the webstats from the last couple of months gives these top 10
>>>>>> Cookbook pages that
>>>>>> have been accessed frequently. This list is ordered by nr. of  pageviews
>>>>>>
>>>>>> 1. /wiki/BioJava:Cookbook:Alphabets
>>>>>> 2. /wiki/BioJava:CookBook:Blast:Parser
>>>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta
>>>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES
>>>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2
>>>>>> 6. /wiki/BioJava:CookBook:PDB:read
>>>>>> 7. /wiki/BioJava:Cookbook:Sequence
>>>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta
>>>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI
>>>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse
>>>>>>
>>>>>> I would group these pages into 2 groups.
>>>>>> A) How to work with core concepts of BioJava
>>>>>> B) How to use a functionality of BioJava to achieve a certain goal
>>>>>>
>>>>>> The "conceptual" pages (A) I would identify as
>>>>>> * How to get an Alphabet
>>>>>> * How to make a Sequence Object from a String or make a Sequence Object
>>>>>> back into a String
>>>>>>
>>>>>> The "functionality"  pages (B) I would summarize as
>>>>>> * How to parse a Blast output
>>>>>> * How to read sequences from a Fasta file
>>>>>> * How to read a GenBank, SwissProt or EMBL file
>>>>>> * How to generate a global or local alignment with the Needleman-Wunsch-
>>>>>> or the Smith-Waterman-algorithm
>>>>>> * How to read a protein structure - PDB file
>>>>>> * How to export a sequence to fasta
>>>>>> * How to view a sequence in a gui
>>>>>> * How to parse a Fasta database search output file
>>>>>>
>>>>>>
>>>>>> As a conclusion I would suggest that BioJava should have the goal to
>>>>>> provide easy access to the
>>>>>> core "functionalities" (group B).  I believe that we should try to keep
>>>>>> the "concepts" that are being used to
>>>>>> achieve these functionalities as simple as possible. In this sense, I
>>>>>> feel that we have too many hits on the group A pages.
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>> -----------------------------------------------------------------------
>>>>>>
>>>>>> Andreas Prlic      Wellcome Trust Sanger Institute
>>>>>>                               Hinxton, Cambridge CB10 1SA, UK
>>>>>>              +44 (0) 1223 49 6891
>>>>>>
>>>>>> -----------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> --The Wellcome Trust Sanger Institute is operated by Genome
>>>>>> ResearchLimited, a charity registered in England with number 1021457 and
>>>>>> acompany registered in England with number 2742969, whose
>>>>>> registeredoffice is 215 Euston Road, London, NW1 2BE.
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB
>> 8RPZSfbrr9Nfbk3AlqqAet8=
>> =K3qH
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>



More information about the biojava-dev mailing list