[Biojava-dev] The future of BioJava

Richard Holland holland at ebi.ac.uk
Thu Sep 20 09:04:53 UTC 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is one of my main bugbears too. I've never quite understood why we
can't just use Strings, and resort to SymbolLists only when more
advanced manipulation is required (e.g. quality scores for each base).
After all, a String is a memory word overhead (32- or 64-bits) plus
16-bits (unicode) per character, but most SymbolList implementations are
a memory word overhead plus an additional entire memory word per Symbol,
each word being a pointer to the memory location where the Symbol
singleton lives. So SymbolLists actually use more memory than Strings,
not less.

(This is not true for CompressedSymbolList which represents sequences as
a sequence of bits, grouped into groups large enough to uniquely
identify any single symbol in the alphabet - e.g. 2 bits for DNA).

As you say, most users just want to read a sequence, sublist it, maybe
reverse comp it or run some simple search over it. This can all easily
be achieved straight from String format.

The other 'category A' problems are equally important. Could you add a
section to the Wiki about these and the 'category B' problems?  Then we
can use this as a priority use-case list when it comes to actual
development.

cheers,
Richard


Andy Yates wrote:
> Hi,
> 
> I would say yes to this as well. It is very important to know what green
> people are attempting to do with BioJava rather than us assuming that we
> know :). There are parts in BioJava where the flexibility of the code is
> not sufficient for other people who want to use the code base & in other
> areas too flexible.
> 
> I've talked to quite a few people over the years who have used biojava
> for simple & complex applications and they all seem to come back round
> to a few key problems:
> 
> * Sequence & SymbolLists are strange and why can't I use a String - All
> of this makes a lot more sense if you know about the flyweight pattern;
> if not it just seems very strange.
> 
> * I have a format that's EMBL like. Can I parse it using Biojava?
> 
> * How do I read in a FASTA file?
> 
> * How can I get X from this chromatogram & can I parse my specific trace
> format into a BioJava object?
> 
> As Andreas said it's the occurrence of the category A problems that are
> the most worrying. In terms of sequences I think I can see why people
> have a problem with it.
> 
> Just if we take this as an example:
> 
> I have my DNA sequence in a String I can substring it, perform a regular
> expression over it, replace sections, pad it out, format it & so on. If
> I have a Sequence object I can perform most of these actions but the
> interface to them seems unintuitive. Things like calling seqString() to
> get the String back out from a sequence rather than calling toString().
> Also lets say I want to use a sequence as a key in a hash map or ask if
> two sequences are equal (using the old sequence objects) ... at the
> moment I'd have to convert Sequence -> String to perform the comparison
> (and that doesn't include checking a Sequence for alphabet equality).
> 
> I know this sounds like nit-picking & for people who have used biojava
> extensively a lot of this makes sense. For someone new to the project it
> seems like we've done something just for the sake of it and we need to
> get rid of that feeling which I'm sure will happen if we address the
> category A problem. The rest will fall into place :)
> 
> Andy
> 
> Richard Holland wrote:
> I totally agree.
> 
> Can you post a short summary of this to the Wiki page?
> 
> Not all aspects of BioJava are documented, leading people either to give
> up, consult the JavaDocs online, or post a message to biojava-l or
> biojava-dev.
> 
> Is it possible to get similar stats to the ones you have calculated for
> the JavaDoc pages on our website?
> 
> Also, is it possible to build some kind of index over the mailing list
> archives to pull out the most frequently used terms?
> 
> cheers,
> Richard
> 
> Andreas Prlic wrote:
>>>> Hi,
>>>>
>>>> A question related to the discussion of how to design a future BioJava
>>>> is to have a look
>>>> at which parts of BioJava are being actively used and how to improve
>>>> these.
>>>>
>>>> So what are the most frequently used bits of BioJava? One way to look at
>>>> this is to go to the
>>>> web-stats and see how many hits we have got on our documentation web
>>>> pages.
>>>>
>>>> In an ideal world BioJava would be so simple to use, that nobody needs
>>>> to read any docu.
>>>> Unfortunately we are far away from this, so actually looking at these
>>>> stats gives an impression
>>>> on
>>>>
>>>> * topics / functionality which are of particular interest to the
>>>> community
>>>> * topics / functionality which might not be straightforward to use,
>>>> therefore there are many hits on these pages.
>>>>
>>>> A look at the webstats from the last couple of months gives these top 10
>>>> Cookbook pages that
>>>> have been accessed frequently. This list is ordered by nr. of  pageviews
>>>>
>>>> 1. /wiki/BioJava:Cookbook:Alphabets
>>>> 2. /wiki/BioJava:CookBook:Blast:Parser
>>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta
>>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES
>>>> 5. /wiki/BioJava:CookBook:DP:PairWise2
>>>> 6. /wiki/BioJava:CookBook:PDB:read
>>>> 7. /wiki/BioJava:Cookbook:Sequence
>>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta
>>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI
>>>> 10. /wiki/BioJava:CookBook:Fasta:Parse
>>>>
>>>> I would group these pages into 2 groups.
>>>> A) How to work with core concepts of BioJava
>>>> B) How to use a functionality of BioJava to achieve a certain goal
>>>>
>>>> The "conceptual" pages (A) I would identify as
>>>> * How to get an Alphabet
>>>> * How to make a Sequence Object from a String or make a Sequence Object
>>>> back into a String
>>>>
>>>> The "functionality"  pages (B) I would summarize as
>>>> * How to parse a Blast output
>>>> * How to read sequences from a Fasta file
>>>> * How to read a GenBank, SwissProt or EMBL file
>>>> * How to generate a global or local alignment with the Needleman-Wunsch-
>>>> or the Smith-Waterman-algorithm
>>>> * How to read a protein structure - PDB file
>>>> * How to export a sequence to fasta
>>>> * How to view a sequence in a gui
>>>> * How to parse a Fasta database search output file
>>>>
>>>>
>>>> As a conclusion I would suggest that BioJava should have the goal to
>>>> provide easy access to the
>>>> core "functionalities" (group B).  I believe that we should try to keep
>>>> the "concepts" that are being used to
>>>> achieve these functionalities as simple as possible. In this sense, I
>>>> feel that we have too many hits on the group A pages.
>>>>
>>>> Andreas
>>>>
>>>> -----------------------------------------------------------------------
>>>>
>>>> Andreas Prlic      Wellcome Trust Sanger Institute
>>>>                               Hinxton, Cambridge CB10 1SA, UK
>>>>              +44 (0) 1223 49 6891
>>>>
>>>> -----------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> --The Wellcome Trust Sanger Institute is operated by Genome
>>>> ResearchLimited, a charity registered in England with number 1021457 and
>>>> acompany registered in England with number 2742969, whose
>>>> registeredoffice is 215 Euston Road, London, NW1 2BE.
_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB
8RPZSfbrr9Nfbk3AlqqAet8=
=K3qH
-----END PGP SIGNATURE-----



More information about the biojava-dev mailing list