[Biojava-l] New idea for alignment parsing and Re: Parser for MrBayes output

Ben Stöver benstoever at uni-muenster.de
Thu Nov 6 12:49:34 UTC 2014



Spencer Bliven schrieb am 2014-11-06:
> Ben,

> This sounds like a great idea and a really useful addition to
> biojava! I
> would lean towards only parsing the consensus tree, as the other
> formats
> are pretty specific use cases. We're sure forester doesn't provide
> Nexus
> parsing, right? The documentation isn't particularly complete, but
> it's
> already a phylo dependency so we should avoid duplicating any
> features.

No, I'm personally not 100 % sure if any Nexus features are implemented in
forester, but I thought they are not, because otherwise there would have been
no Nexus parsing system in BioJava 1.x?


> As to your second suggestion, it sounds very similar to how
> FastaReader
> currently works, with the user providing a SequenceCreator which
> instantiates whatever Sequence implementation you want to use.
> Mutable
> sequences can lead to a host of additional problems, which is why the
> sequences are currently generated atomically. Or am I
> misunderstanding your
> suggestion?

I just looked at the code
(https://github.com/biojava/biojava/blob/master/biojava3-core/src/main/java/org/biojava3/core/sequence/io/FastaReader.java
) and SequenceCreator does not do exactly what I meant, since in the process()
method of FastaReader, the whole sequence is first loaded into a StringBuilder
and afterwards passed to sequenceCreator, which means there is no compression
during loading. So SequenceCreator does a part of what I was thinking of, but
it would not work for very large sequences. (Although I don't find it now, I
think I read a similar statement somewhere in the JavaDocs of the compresses
Sequence implementation.)

The main benefits I still see for the idea, would first be the abstract
strategy pattern for alignment parsers which would allow to write code
independent of the used format (which is not possible e.g. with the current
FASTA reader) and second editable sequences would of course be usable in use
cases you cannot really solve with the current sequence model (e.g. using it
as the data backend for an alignment editor or GUI components I have in
LibrAlign).

I'm not sure which problems you mean which would arise from having mutable
sequences (remember: the idea was not to replace current implementations of
the Sequence interface, but to add additional mutable versions). Mayby you
could give same examples? (Are thinking about the need for change listers or
similar things?)

Anyway it was only an idea for discussion, I'm really not saying that we
definitely need to go in that direction. (For my own projects I already have a
mutable sequence model with bridges to the current BioJava model, so I would
be fine there.) Maybe there are really problems comming with this idea I
currently do not see? In that case we could of course also think about just
adding a interface for sequence parsers, that allows to use them in an
abstract strategy pattern. (That would than really be a slight API change, if
the existing readers and writers would implement such an interface, but it
might be possible, when there is anyway a version 4 comming?)

Best
Ben


> It would be fantastic to have some additional development of multiple
> alignments and the phylo package! Thanks for the offer to contribute!

> -Spencer

> On Thu, Nov 6, 2014 at 12:19 PM, Jose Manuel Duarte
> <jose.duarte at psi.ch>
> wrote:

> > Hi Ben

> > Thanks a lot for all the insights. I am really not the most
> > appropriate
> > person to comment on all the biojava phylogeny and sequence related
> > things
> > but anyway below are some of my opinions.


> > On 05/11/14 17:22, Ben Stöver wrote:



> >> The more interesting/urgent thing though might be parsing the
> >> consensus
> >> tree
> >> which is in Nexus format (or writing the input files for MrBayes).
> >> Although
> >> the Nexus format is not really state of the art anymore and
> >> replacements
> >> like
> >> e.g. NeXML (http://nexml.org/ )  - which overcome its limitations
> >> -
> >> should be
> >> prefered if you implement a new software, the Nexus format is
> >> still widely
> >> used and supporting in BioJava 3 (or 4) would surely be a good
> >> idea.
> >> There was
> >> a extensible Nexus parser in BioJava 1.x
> >> (http://www.biojava.org/docs/api1.9.1/org/biojavax/bio/
> >> phylo/io/nexus/package-summary.html
> >> ) which could be ported to BioJava 3 (4). (This has never been
> >>   done until
> >> now,
> >> hasen't it?)


> > If I understand it properly they were not ported yet to 3 because
> > of lack
> > of time, so I think the porting of the nexus stuff would be a great
> > thing.
> > +1 to that.



> >> Therefore I would offer to implement such functionality for
> >> BioJava, but
> >> before making a pull request or anything, I wanted to ask for
> >> opinion of
> >> the
> >> cummunity on that idea and also if I might have missed concepts in
> >> BioJava
> >> that would currently already allow to do something similar.


> > To me the whole idea sounds great. Especially if it can be made
> > compatible
> > with the existing Biojava interfaces. If I understand what you
> > propose, you
> > would only introduce a new way of parsing things which could even
> > live
> > alongside the current parsers. It could even go to its own package
> > (sequence.nio ?). For me this is a +1 too.

> > Cheers

> > Jose

> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biojava-l




More information about the Biojava-l mailing list