[Biojava-l] New idea for alignment parsing and Re: Parser for MrBayes output

Ben Stöver benstoever at uni-muenster.de
Wed Nov 5 16:22:34 UTC 2014


Hi Pola and Jose (and BioJava community),

this mail is first a reply on the MrBayes topoc, but also contains a new idea
for parsing multiple sequence alignments in BioJava, which is in this case
closely linked. I'm sorry it has quite a long text, but I wanted to be clear
about what I mean.


MrBayes Topic:

I'm following this topic with interest, because I'm currently working
programmatically both with BioJava and MrBayes, although I don't know anything
about the concrete plans behind the BioJava-issue on MrBayes.

My first question would be: Do we want to parse just the consensus tree output
or also the markov chain protocol files (*.p and *.t) which log the states of
the MCMC runs (where *.t contains the trees and *.p the other parameter
states). Parsing such files might be interesting to analyse a run (e.g. to
adress questions like "Was the stationary state reached early enough?" "Did
the chain run long enough?", ...). Features like this would be similar to the
features implemented in Tracer (http://tree.bio.ed.ac.uk/software/tracer/ )
which comes with BEAST. It could also be considered that BEAST2 and other
programs output similar data possibly in a similar format.

The more interesting/urgent thing though might be parsing the consensus tree
which is in Nexus format (or writing the input files for MrBayes). Although
the Nexus format is not really state of the art anymore and replacements like
e.g. NeXML (http://nexml.org/ )  - which overcome its limitations - should be
prefered if you implement a new software, the Nexus format is still widely
used and supporting in BioJava 3 (or 4) would surely be a good idea. There was
a extensible Nexus parser in BioJava 1.x
(http://www.biojava.org/docs/api1.9.1/org/biojavax/bio/phylo/io/nexus/package-summary.html
) which could be ported to BioJava 3 (4). (This has never been done until now,
hasen't it?)

The thing about Nexus is that it can contain tree and sequence and meta data,
so a complete parser would need to have all these different functions and the
previous approach of having a set of plug-in-classes for each Nexus-block made
sense to me.

If you are thinking about writing a whole new parser you can also have a look
of the code I already wrote the phylogenetic tree editor TreeGraph as a
starting point:
http://bioinfweb.info/Code/sventon/repos/TreeGraph2/list/trunk/main/src/info/bioinfweb/treegraph/document/io/nexus/?revision=HEAD
(Of course you would have to use a different tree model in BioJava, maybe
forester, if that is the current standard.)


Sequence parsers:

This already leads to another topic (I was plannung to post to this list some
time anyway): When talking about sequence parsers I would have another idea to
implement a general parser framework for multiple sequence alignment in
BioJava to which different parsers (implementing according interfaces) can be
added with the time to support many formats with an abstract strategy pattern.

In contrast to the current parsers in BioJava (e.g.
http://www.biojava.org/docs/api/org/biojava3/core/sequence/io/FastaReader.html
), the ones I'm thinking of should not themselves decide on which
implementation of the Sequence interface to use but keep this decision to the
user of the class. To achieve this, I would propose an interface extending the
Sequence interface called e.g. EditableSequence which additionally offers
methods like setTokenAt(), insertTokenAt(), removeTokenAt(), ... .
Implementations of the parser classes would than just use this methods to load
the sequence into RAM instead of the current way. (And of course a general way
for creating instances of EditableSequence implementations which would be no
problem with according factory method definitions.)

The benefit from this is, that the storage method would be independet from the
parser class, which would allow to use e.g. compressed sequence storage like
http://www.biojava.org/docs/api/index.html?org/biojava3/core/sequence/storage/TwoBitSequenceReader.html
currently does or a cached sequence for large data sets, ... . (If I haven't
missed something, the problem with the current implementaiton is that you
cannot benefit from the compression of such classes, if you do not implement
your own parser that does not load all sequences into a string first to pass
this to the contructor of a Sequence implementation.)

Of course new implementations would be needed for EditableSequence, but since
this interface extends Sequence such new classes would be fully interoperable
with current code relying on the Sequence interface but additionally offer
editbiliy.

I already implemented a similar framework in one of my current projects
LibrAlign (http://bioinfweb.info/LibrAlign/ ) which is a Java GUI library for
multiple sequence alignments and attached raw and meta data which is
compotible with BioJava. See
http://bioinfweb.info/Code/sventon/repos/LibrAlign/show/trunk/main/src/info/bioinfweb/libralign/sequenceprovider/SequenceDataProvider.java?revision=HEAD
and
http://bioinfweb.info/Code/sventon/repos/LibrAlign/show/trunk/main/src/info/bioinfweb/libralign/sequenceprovider/implementations/PackedSequenceDataProvider.java?revision=HEAD
. Implementations of similar parsers can be found here:
http://bioinfweb.info/Code/sventon/repos/Commons.Java/list/trunk/main/experimental/info/bioinfweb/commons/bio/biojava3/alignment/io/?revision=HEAD&bypassEmpty=true

Therefore I would offer to implement such functionality for BioJava, but
before making a pull request or anything, I wanted to ask for opinion of the
cummunity on that idea and also if I might have missed concepts in BioJava
that would currently already allow to do something similar.

I would be happy to get some feedback on that idea.

@Pola: If you have further questions on MrBayes, let me know. I could also
send you some illustrations on how the MCMC works from one of my lectures, if
needed.
http://www2.ieb.uni-muenster.de/EvolBiodivPlants/en/Teaching/WS2013_2014/MolecularPhylogenetics

Best
Ben

Dipl. Biologe Ben Stöver
Evolution und Biodiversity of Plants Group
Institute for Evolution and Biodiversity
University of Münster
Germany
Phone: +49 251 83 21647
Fax: +49 251 83 24668
http://www2.ieb.uni-muenster.de/EvolBiodivPlants/en/People/Stoever
BenStoever at uni-muenster.de



Jose Manuel Duarte schrieb am 2014-11-05:
> Hi Pola

> Welcome and great that you want to plunge in! I don't know much about
> MrBayes myself, but the idea was to include a parser in the
> biojava3-phylo module. The module uses forester
> (https://code.google.com/p/forester/wiki/forester) as the underlying
> library to deal with phylogeny data. So the idea would be to parse
> the
> output into a forester data structure (most likely into
> org.forester.phylogeny.Phylogeny).

> Anyway hopefully someone with a bit more knowledge about this might
> be
> able to add something.

> Cheers

> Jose


> On 31/10/14 18:17, Pola Kyzioł wrote:
> >Hello,

> >my name is Pola and I'm currently a third year student in the
> >Theoretical Computer Science
> >at Jagiellonian University. I have also interest in biology,
> >especially in the field of genetics.
> >I've been searching a project connected with bioinformatics which I
> >could develop
> >and next use to writing my bachelor's thesis. I've found BioJava and
> >looked at its issues -
> >parser for MrBayes output seems for me to be interesting to code.
> >I would like to know some details about it:
> >- what data you want extracted from MrBayes' output files;
> >- how the created model should look like and if appropriate modules
> >already exist.

> >Thanks for your help,
> >Pola


> >_______________________________________________
> >Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> >http://mailman.open-bio.org/mailman/listinfo/biojava-l




More information about the Biojava-l mailing list