[Biojava-dev] The future of BioJava
Mark Schreiber
markjschreiber at gmail.com
Thu Sep 20 09:28:14 UTC 2007
The main value of the Symbol representation comes in when you do
Distributions and DP which is really why Matthew and Thomas developed
it. Quite probably why they developed biojava at all. If you are just
pushing data around which seems to be most applications then Strings
are better.
I have previously proposed seperating the Symbol, Alphabet, DP and
Dist from the rest of the packages because they have value well beyond
biology but an equal argument would be that most bio stuff doens't
need this level of analysis. If you only want to convert EMBL to Fasta
or read a BLAST result you don't need it.
For those who want to read in EMBL and compute some Distribution or
run a Hidden Markov Model then I would propose the conversion of
Stringy sequences to SymbolLists at the point when it is needed not at
the point when you read them in. Given that almost all I/O of
sequence starts and ends as a String the point where you convert to
Symbols doesn't matter much. The only question is do you need to
convert to Symbols for the analysis you are doing?
(Sorry for not putting this on the wiki, I'll do it later).
- Mark
On 9/20/07, Richard Holland <holland at ebi.ac.uk> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This is one of my main bugbears too. I've never quite understood why we
> can't just use Strings, and resort to SymbolLists only when more
> advanced manipulation is required (e.g. quality scores for each base).
> After all, a String is a memory word overhead (32- or 64-bits) plus
> 16-bits (unicode) per character, but most SymbolList implementations are
> a memory word overhead plus an additional entire memory word per Symbol,
> each word being a pointer to the memory location where the Symbol
> singleton lives. So SymbolLists actually use more memory than Strings,
> not less.
>
> (This is not true for CompressedSymbolList which represents sequences as
> a sequence of bits, grouped into groups large enough to uniquely
> identify any single symbol in the alphabet - e.g. 2 bits for DNA).
>
> As you say, most users just want to read a sequence, sublist it, maybe
> reverse comp it or run some simple search over it. This can all easily
> be achieved straight from String format.
>
> The other 'category A' problems are equally important. Could you add a
> section to the Wiki about these and the 'category B' problems? Then we
> can use this as a priority use-case list when it comes to actual
> development.
>
> cheers,
> Richard
>
>
> Andy Yates wrote:
> > Hi,
> >
> > I would say yes to this as well. It is very important to know what green
> > people are attempting to do with BioJava rather than us assuming that we
> > know :). There are parts in BioJava where the flexibility of the code is
> > not sufficient for other people who want to use the code base & in other
> > areas too flexible.
> >
> > I've talked to quite a few people over the years who have used biojava
> > for simple & complex applications and they all seem to come back round
> > to a few key problems:
> >
> > * Sequence & SymbolLists are strange and why can't I use a String - All
> > of this makes a lot more sense if you know about the flyweight pattern;
> > if not it just seems very strange.
> >
> > * I have a format that's EMBL like. Can I parse it using Biojava?
> >
> > * How do I read in a FASTA file?
> >
> > * How can I get X from this chromatogram & can I parse my specific trace
> > format into a BioJava object?
> >
> > As Andreas said it's the occurrence of the category A problems that are
> > the most worrying. In terms of sequences I think I can see why people
> > have a problem with it.
> >
> > Just if we take this as an example:
> >
> > I have my DNA sequence in a String I can substring it, perform a regular
> > expression over it, replace sections, pad it out, format it & so on. If
> > I have a Sequence object I can perform most of these actions but the
> > interface to them seems unintuitive. Things like calling seqString() to
> > get the String back out from a sequence rather than calling toString().
> > Also lets say I want to use a sequence as a key in a hash map or ask if
> > two sequences are equal (using the old sequence objects) ... at the
> > moment I'd have to convert Sequence -> String to perform the comparison
> > (and that doesn't include checking a Sequence for alphabet equality).
> >
> > I know this sounds like nit-picking & for people who have used biojava
> > extensively a lot of this makes sense. For someone new to the project it
> > seems like we've done something just for the sake of it and we need to
> > get rid of that feeling which I'm sure will happen if we address the
> > category A problem. The rest will fall into place :)
> >
> > Andy
> >
> > Richard Holland wrote:
> > I totally agree.
> >
> > Can you post a short summary of this to the Wiki page?
> >
> > Not all aspects of BioJava are documented, leading people either to give
> > up, consult the JavaDocs online, or post a message to biojava-l or
> > biojava-dev.
> >
> > Is it possible to get similar stats to the ones you have calculated for
> > the JavaDoc pages on our website?
> >
> > Also, is it possible to build some kind of index over the mailing list
> > archives to pull out the most frequently used terms?
> >
> > cheers,
> > Richard
> >
> > Andreas Prlic wrote:
> >>>> Hi,
> >>>>
> >>>> A question related to the discussion of how to design a future BioJava
> >>>> is to have a look
> >>>> at which parts of BioJava are being actively used and how to improve
> >>>> these.
> >>>>
> >>>> So what are the most frequently used bits of BioJava? One way to look at
> >>>> this is to go to the
> >>>> web-stats and see how many hits we have got on our documentation web
> >>>> pages.
> >>>>
> >>>> In an ideal world BioJava would be so simple to use, that nobody needs
> >>>> to read any docu.
> >>>> Unfortunately we are far away from this, so actually looking at these
> >>>> stats gives an impression
> >>>> on
> >>>>
> >>>> * topics / functionality which are of particular interest to the
> >>>> community
> >>>> * topics / functionality which might not be straightforward to use,
> >>>> therefore there are many hits on these pages.
> >>>>
> >>>> A look at the webstats from the last couple of months gives these top 10
> >>>> Cookbook pages that
> >>>> have been accessed frequently. This list is ordered by nr. of pageviews
> >>>>
> >>>> 1. /wiki/BioJava:Cookbook:Alphabets
> >>>> 2. /wiki/BioJava:CookBook:Blast:Parser
> >>>> 3. /wiki/BioJava:Cookbook:SeqIO:ReadFasta
> >>>> 4. /wiki/BioJava:Cookbook:SeqIO:ReadGES
> >>>> 5. /wiki/BioJava:CookBook:DP:PairWise2
> >>>> 6. /wiki/BioJava:CookBook:PDB:read
> >>>> 7. /wiki/BioJava:Cookbook:Sequence
> >>>> 8. /wiki/BioJava:Cookbook:SeqIO:WriteInFasta
> >>>> 9. /wiki/BioJava:CookBook:Interfaces:ViewInGUI
> >>>> 10. /wiki/BioJava:CookBook:Fasta:Parse
> >>>>
> >>>> I would group these pages into 2 groups.
> >>>> A) How to work with core concepts of BioJava
> >>>> B) How to use a functionality of BioJava to achieve a certain goal
> >>>>
> >>>> The "conceptual" pages (A) I would identify as
> >>>> * How to get an Alphabet
> >>>> * How to make a Sequence Object from a String or make a Sequence Object
> >>>> back into a String
> >>>>
> >>>> The "functionality" pages (B) I would summarize as
> >>>> * How to parse a Blast output
> >>>> * How to read sequences from a Fasta file
> >>>> * How to read a GenBank, SwissProt or EMBL file
> >>>> * How to generate a global or local alignment with the Needleman-Wunsch-
> >>>> or the Smith-Waterman-algorithm
> >>>> * How to read a protein structure - PDB file
> >>>> * How to export a sequence to fasta
> >>>> * How to view a sequence in a gui
> >>>> * How to parse a Fasta database search output file
> >>>>
> >>>>
> >>>> As a conclusion I would suggest that BioJava should have the goal to
> >>>> provide easy access to the
> >>>> core "functionalities" (group B). I believe that we should try to keep
> >>>> the "concepts" that are being used to
> >>>> achieve these functionalities as simple as possible. In this sense, I
> >>>> feel that we have too many hits on the group A pages.
> >>>>
> >>>> Andreas
> >>>>
> >>>> -----------------------------------------------------------------------
> >>>>
> >>>> Andreas Prlic Wellcome Trust Sanger Institute
> >>>> Hinxton, Cambridge CB10 1SA, UK
> >>>> +44 (0) 1223 49 6891
> >>>>
> >>>> -----------------------------------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>> --The Wellcome Trust Sanger Institute is operated by Genome
> >>>> ResearchLimited, a charity registered in England with number 1021457 and
> >>>> acompany registered in England with number 2742969, whose
> >>>> registeredoffice is 215 Euston Road, London, NW1 2BE.
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFG8je04C5LeMEKA/QRAn9qAJoD8pm6gf66bUemweX15IGGwrLXowCgkJcB
> 8RPZSfbrr9Nfbk3AlqqAet8=
> =K3qH
> -----END PGP SIGNATURE-----
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
More information about the biojava-dev
mailing list