[Biojava-dev] The future of BioJava

Richard Holland holland at ebi.ac.uk
Fri Sep 21 08:47:55 UTC 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Also could we make SymbolList implement List? The iterator() method
would then do the cached conversion if required before returning an
Iterator<Symbol> over the symbols. That would make it very pluggable.
We'd need it to have a settable flag indicating whether the user wants
1-indexed or 0-indexed access (the default being 1-indexed as this is
the most common biological use).

Only downside is that List uses generics and so SymbolList must too -
meaning that SymbolList must always be declared as SymbolList<Symbol>
(or some subclass of Symbol).

But that's also an upside - you could subclass Symbol into DNASymbol,
RNASymbol, etc. etc. - meaning that an alphabet is tied directly to the
symbol and need not be specified separately:

  SymbolList<DNASymbol> dna = new SymbolList<DNASymbol>();
  dna.add(RNAAlphabet.Q); // Throws standard List exception!

  SymbolList<CompoundSymbol<DNASymbol,ScoreSymbol>> = new ....; // Cool!

Also cool is that you could do this:

  public SymbolList<RNASymbol> translate(SymbolList<DNASymbol> dna);
      // Also cool!

cheers,
Richard

Richard Holland wrote:
> I like that idea of having SymbolLists backed by different things. I'd
> suggest that by default, all sequences read from file should be
> String-backed SymbolLists, and that they are not broken down into
> Symbols until first requested to do so by code that needs to know the
> actual Symbols (e.g. code that cares about ambiguity symbols). The same
> applies in reverse - lists constructed from symbols should not be
> converted to strings until needed.
> 
> Something like this:
> 
>   SymbolList sl = new SymbolList();
>   sl.setString("AGCGGACT");
>             // Changes the string, and clears any cached
>             // conversion of it.
>   String seq = sl.getString();
>             // Dumps the string. If not already converted
>             // to a string, does the conversion and
>             // caches it first.
>   char base = sl.charAt(5);
>             // 1-indexed single-base string. This would
>             // likely delegate to String.charAt() and only
>             // works for single-character alphabets. Not
>             // to be used in any other cirumstances.
>   sl.set/getAlphabet()....
>             // Use these to set the alphabet before
>             // using set/getSymbols()/symbolAt().
>   sl.setSymbols(new List<Symbol>(....));
>             // Uses the list to update the cached symbols
>             // and clear the cached string.
>   List<Symbol> syms = sl.getSymbols();
>             // Converts if not already converted, caches
>             // the conversion, and returns it.
>   Symbol sym = sl.symbolAt(5);
>             // 1-indexed fully flexible symbol finder.
> 
> toString() would delegate to getString(), as would hashCode(), equals(),
> and compareTo(). We could provide additional equals()-style methods for
> testing equality whilst taking into account ambiguities.
> 
> cheers,
> Richard
> 
> 
> Mark Schreiber wrote:
>>> Hello -
>>>
>>> Just to clarify my opinion on Strings vs Symbols.
>>>
>>> I generally prefer Symbols and SymbolLists to Strings cause
>>> SymbolLists are smart and Strings are dumb. Classic case is ambiguity
>>> symbols like 'W'. BioJava knows, in the context of DNA this is A or T.
>>> However, I think it would be vastly simpler if there where simpler
>>> getters and setters for SymbolLists that exposed Strings in a
>>> friendlier manner.
>>>
>>> I also think there is a case for SymbolLists that are backed by
>>> Strings (more likely a char[]) instead of Symbol arrays and only do
>>> the needed conversion when required (ie, when the user calls
>>> SymbolAt().  These would be ideal for the case where someone is
>>> converting GenBank to Fasta and there is no need to go through the
>>> Symbol parsing.
>>>
>>> Finally, I think SymbolLists (or whatever they get called) should
>>> implement more of the methods found in String to make them look more
>>> like Strings.  Ideally we should think about implementing some of the
>>> methods that Groovy likes to use for operator overloading. If we do
>>> this is would be possible to concatenate two sequences in groovy by
>>> doing this (I may have the syntax wrong).
>>>
>>> Seq3 = Seq1 + Seq2
>>>
>>> The other issue with SymbolLists is that they are not intuitive to
>>> construct because they are not so bean like. This is not just a
>>> problem for newbies but also a major hinderance to the use of JEE,
>>> Spring, JAXB and other important frameworks. It should be possible to
>>> do this:
>>>
>>> SymbolList sl = new SymbolList();
>>> sl.setName("AB123456");
>>> sl.setSequence(seqString);
>>>
>>> The final hinderance to the use of JEE is serialization. If we keep
>>> Symbols flyweight (singleton) we need to make this bullet proof from
>>> the start. It is also practicaly impossible to make something a bean
>>> and make it a Singleton, some careful thought is required.  If we keep
>>> symbols behind the scenes they may not need to be so bean like.
>>>
>>> - Mark
>>>
>>> On 9/21/07, george waldon <gwaldon at geneinfinity.org> wrote:
>>>> Hello,
>>>>
>>>> All this is very exciting. I would certainly contribute to something like that. A few remarks that come to my mind while reading all these emails.
>>>>
>>>> I noticed that the tutorial has seriously improved  thanks for the work. I remember my initial steps going to understanding Symbol and cross-alphabets (&)  Still, from time to time, I have difficulties with basic things that are not intuitive to me such as "token", e.g. Alphabet.getTokenizarion("token") or SymbolTokenization.tokenizeSymbolList(SymbolList).
>>>>
>>>> I am surprised by the all the requests to use String instead of SymbolList. The CookBook tells precisely, and with code examples, how to make most of all basic operations. Maybe someone could illustrate the new kind of code versus the old one? I bet many newbies (and older one) actually get their answer in the Cookbook.
>>>>
>>>> Richard wrote:
>>>>> It is suggested that development stops on the existing Biojava(&)
>>>> Well, I don't think the license can let you do that :-)
>>>> Writing new code might be easier but certainly making old code better will improve the level of code abstraction. Therefore I am promoting improving existing Biojava code versus hazardous code rewrite. I can see some of the initial steps on the roadmap:
>>>> - Switch to Subversion repository
>>>> - Change of the build process compatible with creation of modules
>>>> - Improving testing frame (mentioned several times)
>>>> - Creation of white papers for coding practices, build releases, (others?)
>>>>
>>>> Then maybe the proper work of restructuring Biojava may start. We can either divide the existing mammoth into multiple modules at first or - my preference  building modules one by one by selectively picking classes. This way it will be easy to find out classes that can be deprecated (by lack of users) and we can even have a deprecated module at the end. Some coupling may need to loosen up. We will also need a list of API change for developers who will use the newer version.  I am sure that the kind of data structures proposed by Richard could find their place as well as some of the proposed patterns (beans, others?)
>>>>
>>>> Anyway, all these are simple ideas. I am not an expert in build process, but I can help with improving javadocs, writing examples and test cases. I have also a fair knowledge of the molecular biology package.
>>>>
>>>> Hope it helps,
>>>> George
>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG84U64C5LeMEKA/QRAtyfAJ9PAsFu3+zjUhP3Xcs5imojL/cb/wCfRX8V
eOMOo3pCl71dPhZMyYlBBE4=
=NByU
-----END PGP SIGNATURE-----



More information about the biojava-dev mailing list