[Biojava-l] Creating an alignment object

Richard Holland richard.holland at ebi.ac.uk
Mon May 15 09:49:47 UTC 2006


One way to write a file parser (which I used in all the BioJavaX
parsers) is to write an event-based one, which requires two parts: a
parser, and an event listener. 

Basically, the parser reads a chunk from the file, recognises what kind
of chunk it is and does some pre-parsing on it, for example stripping
whitespace etc. or concatenating lines of sequence data. It then sends a
signal to an event listener saying it has received a chunk of data of a
certain kind, and asks the event listener to process that data. The
event listener could receive this data in any order (and hence one
listener can be adapted to listen for events from many file formats), so
needs to be aware of its state at any given point during the parsing
process.

The code tends to get quite long and convoluted, but the concept is
quite simple. 

Hopefully this gives you an idea of how to do it - you don't necessarily
need to know any particular programming language in order to design this
kind of parser/listener, just a good knowledge of the file format and
the ability to describe the various interesting sections of a file and
how to spot them. You can then convert these descriptions into Java or
any other language once you've learnt the skills to do so. Regular
expressions can be extremely useful, as are the Java String methods
toUpperCase(), toLowerCase(), contains(), equals(), equalsIgnoreCase(),
startsWith() and endsWith().

It gets a little more complicated once you start allowing for non-
standard files, such as those containing irregular whitespace or extra
blank lines, but if you write a strict parser first (which all the
BioJavaX parsers are), this type of flexibility can be left till later.

Good luck!

cheers,
Richard


On Mon, 2006-05-15 at 10:24 +0100, Nathan S. Haigh wrote:
> That's right, clustalw can output in several formats including fasta. It
> would be nice to have Biojava able to read and write the clustalw format as
> it is a widely used format. How, easy is it to write something like this?
> Maybe when I start to learn more about Java I could have a go at doing this.
> 
> Nath
> 
> > -----Original Message-----
> > From: mark.schreiber at novartis.com [mailto:mark.schreiber at novartis.com]
> > Sent: 15 May 2006 10:16
> > To: Richard Holland
> > Cc: biojava-l at lists.open-bio.org; n.haigh at sheffield.ac.uk
> > Subject: Re: [Biojava-l] Creating an alignment object
> > 
> > I think ClustalW can output alignments as fasta alignment format which
> > biojava definitely can read.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > Richard Holland <richard.holland at ebi.ac.uk>
> > Sent by: biojava-l-bounces at lists.open-bio.org
> > 05/12/2006 04:34 PM
> > 
> > 
> >         To:     n.haigh at sheffield.ac.uk
> >         cc:     biojava-l at lists.open-bio.org, (bcc: Mark
> > Schreiber/GP/Novartis)
> >         Subject:        Re: [Biojava-l] Creating an alignment object
> > 
> > 
> > Sorry for the delay in replying - I had to leave work a bit early
> > yesterday.
> > 
> > > Nope, I don't need to generate an alignment, I already have an alignment
> > in
> > > a file created by third party software (clustalw).
> > 
> > There is nothing that I know of in BioJava that reads ClustalW files
> > directly into Alignment objects. (If someone else knows different,
> > please correct me). There are certainly methods in BioJava which read
> > the alignments from ClustalW into a set of String objects, each one
> > representing a member sequence (see SequenceAlignmentSAXParser), but I
> > don't know of anything more detailed than that.
> > 
> > The third-party package called Strap which I mentioned yesterday happily
> > reads/writes many of the major alignment formats, and has wrappers for
> > running ClustalW and other aligners programatically and reading back in
> > the results, so it is definitely worth a look. You can use a lot of its
> > functions without having to run the GUI, including reading/writing
> > various alignment formats.
> > 
> > >
> > > In fact, the app I'd
> > > eventually like to have written in Java would include some sort of
> > wrapper
> > > for clustalw in order to construct the alignments from a set of
> > unaligned
> > > sequences, but algorithms implemented in Biojava would also be a welcome
> > > addition to the app.
> > 
> > If you want to wrap clustalw, the simplest way would be to create
> > Sequence objects in BioJava, write them out to Fasta using the BioJava
> > sequence IO tools, use the Java 'system' command (or one of the
> > alternatives to it) to run ClustalW. However you still then have the
> > problem of reading the output back in again.
> > 
> > The classes in org.biojava.bio.alignment that I mentioned yesterday
> > implements several useful alignment algorithms which you can use as an
> > alternative to ClustalW.
> > 
> > > But first things first.
> > > If I didn't have any sequences or an alignment in any files. What is the
> > > easiest way to get an alignment object in Java to have a play around
> > with?
> > 
> > Make an instance of FlexibleAlignment from org.biojava.bio.alignment,
> > and use its methods to add sequences to it. It doesn't do any aligning
> > itself - it is just a placeholder to contain sequences and information
> > about how they align. You have to use its methods to add and remove
> > sequences from the alignment, to add/remove gaps and deletions, and get
> > things like consensus sequences etc.
> > 
> > Technically I suppose you could use FlexibleAlignment in conjunction
> > with SequenceAlignmentSAXParser to read alignment members as strings,
> > construct sequences based on them, and add them to the alignment object,
> > but I haven't tried this myself. It'd probably require some extra
> > processing to convert the dashes (gaps) in the inputted strings into
> > proper gaps in the alignment.
> > 
> > > Is there a way to just "magically" create a default alignment of say 5
> > > sequences with 20 positions?
> > 
> > You'd have to manually create yourself 5 sequences and add them to a
> > FlexibleAlignment as described above.
> > 
> > cheers,
> > Richard
> > 
> > --
> > Richard Holland (BioMart Team)
> > EMBL-EBI
> > Wellcome Trust Genome Campus
> > Hinxton
> > Cambridge CB10 1SD
> > UNITED KINGDOM
> > Tel: +44-(0)1223-494416
> > 
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > 
> > 
> 
> ---
> avast! Antivirus: Outbound message clean.
> Virus Database (VPS): 0619-3, 12/05/2006
> Tested on: 15/05/2006 10:24:25
> avast! - copyright (c) 1988-2006 ALWIL Software.
> http://www.avast.com
> 
> 
> 
> 
> 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416




More information about the Biojava-l mailing list