[Biojava-dev] The future of BioJava

Andy Yates ayates at ebi.ac.uk
Thu Sep 20 10:55:13 UTC 2007


Ok I'll add them in. Can you remember if I've actually got a wiki account?

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This is one of my main bugbears too. I've never quite understood why we
> can't just use Strings, and resort to SymbolLists only when more
> advanced manipulation is required (e.g. quality scores for each base).
> After all, a String is a memory word overhead (32- or 64-bits) plus
> 16-bits (unicode) per character, but most SymbolList implementations are
> a memory word overhead plus an additional entire memory word per Symbol,
> each word being a pointer to the memory location where the Symbol
> singleton lives. So SymbolLists actually use more memory than Strings,
> not less.
> 
> (This is not true for CompressedSymbolList which represents sequences as
> a sequence of bits, grouped into groups large enough to uniquely
> identify any single symbol in the alphabet - e.g. 2 bits for DNA).
> 
> As you say, most users just want to read a sequence, sublist it, maybe
> reverse comp it or run some simple search over it. This can all easily
> be achieved straight from String format.
> 
> The other 'category A' problems are equally important. Could you add a
> section to the Wiki about these and the 'category B' problems?  Then we
> can use this as a priority use-case list when it comes to actual
> development.
> 
> cheers,
> Richard
> 
> 
> Andy Yates wrote:
>> Hi,
>>
>> I would say yes to this as well. It is very important to know what green
>> people are attempting to do with BioJava rather than us assuming that we
>> know :). There are parts in BioJava where the flexibility of the code is
>> not sufficient for other people who want to use the code base & in other
>> areas too flexible.
>>
>> I've talked to quite a few people over the years who have used biojava
>> for simple & complex applications and they all seem to come back round
>> to a few key problems:
>>
>> * Sequence & SymbolLists are strange and why can't I use a String - All
>> of this makes a lot more sense if you know about the flyweight pattern;
>> if not it just seems very strange.
>>
>> * I have a format that's EMBL like. Can I parse it using Biojava?
>>
>> * How do I read in a FASTA file?
>>
>> * How can I get X from this chromatogram & can I parse my specific trace
>> format into a BioJava object?
>>
>> As Andreas said it's the occurrence of the category A problems that are
>> the most worrying. In terms of sequences I think I can see why people
>> have a problem with it.
>>
>> Just if we take this as an example:
>>
>> I have my DNA sequence in a String I can substring it, perform a regular
>> expression over it, replace sections, pad it out, format it & so on. If
>> I have a Sequence object I can perform most of these actions but the
>> interface to them seems unintuitive. Things like calling seqString() to
>> get the String back out from a sequence rather than calling toString().
>> Also lets say I want to use a sequence as a key in a hash map or ask if
>> two sequences are equal (using the old sequence objects) ... at the
>> moment I'd have to convert Sequence -> String to perform the comparison
>> (and that doesn't include checking a Sequence for alphabet equality).
>>
>> I know this sounds like nit-picking & for people who have used biojava
>> extensively a lot of this makes sense. For someone new to the project it
>> seems like we've done something just for the sake of it and we need to
>> get rid of that feeling which I'm sure will happen if we address the
>> category A problem. The rest will fall into place :)
>>
>> Andy
>>
>> Richard Holland wrote:
>> I totally agree.
>>
>> Can you post a short summary of this to the Wiki page?
>>
>> Not all aspects of BioJava are documented, leading people either to give
>> up, consult the JavaDocs online, or post a message to biojava-l or
>> biojava-dev.
>>
>> Is it possible to get similar stats to the ones you have calculated for
>> the JavaDoc pages on our website?
>>
>> Also, is it possible to build some kind of index over the mailing list
>> archives to pull out the most frequently used terms?
>>
>> cheers,
>> Richard
>>
>> Andreas Prlic wrote:
>>>>> Hi,
>>>>>
>>>>> A question related to the discussion of how to design a future BioJava
>>>>> is to have a look
>>>>> at which parts of BioJava are being actively used and how to improve
>>>>> these.
>>>>>
>>>>> So what are the most frequently used bits of BioJava? One way to look at
>>>>> this is to go to the
>>>>> web-stats and see how many hits we have got on our documentation web
>>>>> pages.
>>>>>
>>>>> In an ideal world BioJava would be so simple to use, that nobody needs
>>>>> to read any docu.
>>>>> Unfortunately we are far away from this, so actually looking at these
>>>>> stats gives an impression
>>>>> on
>>>>>
>>>>> * topics / functionality which are of particular interest to the
>>>>> community
>>>>> * topics / functionality which might not be straightforward to use,
>>>>> therefore there are many hits on these pages.
>>>>>
>>>>> A look at the webstats from the last couple of months gives these top 10
>>>>> Cookbook pages that
>>>>> have been accessed frequently. This list is ordered by nr. of  pageviews
>>>>>
>>>>> 1. /wiki/BioJava:Cookbook:Alphabets
>>>>> 2. /wiki/BioJava:CookBook:Blast:Parser
>>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta
>>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES
>>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2
>>>>> 6. /wiki/BioJava:CookBook:PDB:read
>>>>> 7. /wiki/BioJava:Cookbook:Sequence
>>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta
>>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI
>>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse
>>>>>
>>>>> I would group these pages into 2 groups.
>>>>> A) How to work with core concepts of BioJava
>>>>> B) How to use a functionality of BioJava to achieve a certain goal
>>>>>
>>>>> The "conceptual" pages (A) I would identify as
>>>>> * How to get an Alphabet
>>>>> * How to make a Sequence Object from a String or make a Sequence Object
>>>>> back into a String
>>>>>
>>>>> The "functionality"  pages (B) I would summarize as
>>>>> * How to parse a Blast output
>>>>> * How to read sequences from a Fasta file
>>>>> * How to read a GenBank, SwissProt or EMBL file
>>>>> * How to generate a global or local alignment with the Needleman-Wunsch-
>>>>> or the Smith-Waterman-algorithm
>>>>> * How to read a protein structure - PDB file
>>>>> * How to export a sequence to fasta
>>>>> * How to view a sequence in a gui
>>>>> * How to parse a Fasta database search output file
>>>>>
>>>>>
>>>>> As a conclusion I would suggest that BioJava should have the goal to
>>>>> provide easy access to the
>>>>> core "functionalities" (group B).  I believe that we should try to keep
>>>>> the "concepts" that are being used to
>>>>> achieve these functionalities as simple as possible. In this sense, I
>>>>> feel that we have too many hits on the group A pages.
>>>>>
>>>>> Andreas
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>> Andreas Prlic      Wellcome Trust Sanger Institute
>>>>>                               Hinxton, Cambridge CB10 1SA, UK
>>>>>              +44 (0) 1223 49 6891
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> --The Wellcome Trust Sanger Institute is operated by Genome
>>>>> ResearchLimited, a charity registered in England with number 1021457 and
>>>>> acompany registered in England with number 2742969, whose
>>>>> registeredoffice is 215 Euston Road, London, NW1 2BE.
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB
> 8RPZSfbrr9Nfbk3AlqqAet8=
> =K3qH
> -----END PGP SIGNATURE-----



More information about the biojava-dev mailing list