[Biojava-dev] Initial impressions...
Matthew Pocock
matthew_pocock at yahoo.co.uk
Thu Jul 3 14:03:18 EDT 2003
Hi,
Len Trigg wrote:
> Hi all,
>
> We've just been evaluating BioJava for some bioinformatics work and
> have been going through a few simple examples. I thought I'd share our
> first impressions, in case they're a useful datapoint.
Great. Always nice to hear war-stories.
>
> It's been hard going initially, as the BioJava apis are quite huge,
> and it sometimes feels like a case of "chase the javadocs" from class
> to class in order to find out how to get things done. In some case the
> javadocs are pretty sparse and we've had to look through source code
> for examples (DNATools is quite instructive). One case where we were
Yes. This is a general criticism of BioJava. I think we need to put in
flashing lights that you should only need to read the interface docs and
the *Tools or *Utils classes to do a lot of things.
> initially confused, is that we thought that there should be an easy
> way to get from a Symbol to it's one-character name (something like
> aSymbol.getAsChar()). We've now found out that you have to go via
> aSymbol.getAlphabet().getTokenization("token").tokenizeSymbol(aSymbol);
We need to make this process much easier. Unfortunately, getAsChar()
doesn't realy work for us because we can have symbols for things that
don't have a single char representation, such as codons. However, you
shouldn't have to end up going through 20 function calls either.
Is there a biojava in anger example of geting letters from symbols?
>
> I have been impressed with how easy it is to parse FASTA files, and
> have used both the method to load all the sequences into a SequenceDB,
> and the low-memory method that returns a SequenceIterator (great for
> large sequence files).
Thanks. This sort of things works reasonably efficiently for the richer
formats as well, such as embl.
>
> Another thing we tried out was to show the suffix trees for a
> sequence. One confusing thing here is that there seem to be a couple
> of different independent implementations of suffix trees in
> BioJava. The SuffixTree documentation doesn't explain how you are
> supposed to navigate the tree (in particular that child nodes are
> indexed by symbol, rather than as a list of children, so you have to
> get an AlphabetIndex to find out where you are).
I'll take a look at the docs. To be honest, this is very old code and
hasn't recently been bashed very hard by the core team.
>
> The UkkonenSuffixTree has a different API to that of the regular
> SuffixTree, and the printTree() method outputs characters that don't
> correspond to the regular symbol representations. Maybe the author of
> this class wasn't aware of how to get the representations of Symbols
> either :-). I have a patch to contribute that addresses this
> (attached).
Francois, would you mind looking at this patch?
>
> Parsing a BLAST output file was also easy, however, I had to use
> "lazy" mode to work with our files (from NCBI BLAST 2.2.1), and I have
> not yet figured out how to extract a) the length of the query
> sequence, and b) the frame of the hits. Any suggestions here?
Is that information in the annotation attached to the
SeqSimilaritySearchSubHit or the SeqSimilritySearchResult?
>
> That's about it at the moment. Soon I intend to look into GFF file
> handling and BioJava/BioSQL integration. Overall I think there is a
> tonne of useful functionality in BioJava -- I look forward to working
> with the BioJava project and hope to be able to make some useful
> contributions.
Good luck with BioSQL and GFF. These are parts of the library that I use
daily. Oh, and for the GFF, start off by using GFFTools.
>
>
> Cheers,
> Len Trigg.
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
--
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk
More information about the biojava-dev
mailing list