[Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"]
Andreas Dräger
andreas.draeger at uni-tuebingen.de
Thu Apr 8 07:13:17 UTC 2010
Hi all,
This e-mail is just for your information about somebody new, who'd like
to contribute to our project.
Cheers
Andreas
Subject:
Re: Fwd: Proposing a project on "Biojava alignment lead"
From:
Andreas Dräger <andreas.draeger at uni-tuebingen.de>
Date:
Wed, 07 Apr 2010 09:27:13 +0200
To:
Cai Shaojiang <caishaojiang at gmail.com>
Hi Cai Shaojiang,
Thank you for you e-mail! I don't know what happened to the e-mail list.
Sometimes it takes a while due to the spam filters, I guess.
> I am a PhD student from National University of Singapore. My major
research area is local alignment algorithms and data structures for SNP
identification. And I have used Java and Eclipse for years for software
development. I am very interested in your GSoC programme. I find that
there is a module called "biojava-alignment lead" whose mentor is you. I
want to propose a new project on this module. I have several questions
about this module.
Yes, that's me. So great to get your support.
> 1. It seems that pairwise alignment is to find similarity between two
short sequences. Existing pairwise alignment is based on dynamic
programming, is it Smith-Waterman algorithm?
So, currently, BioJava contains three different alignment approaches.
There are two deterministic algorithms, i.e., Smith-Waterman for local
alignment and Needleman-Wunsch for global alignment. Third, there is the
possibility to apply Hidden Markov Models for alignment. An example of
the latter approach should be in the cookbook.
> 2. What is the exact task of "refactoring of underlying data structures"?
Yes, this is something, I did last week already but it could still be
improved. The problem was that the alignment algorithms actually
produced a kind of string that looks similar to the output of BLAST.
This string contained the score, the computation time, the length of the
alignment etc. The problem was that people wanted to perform
higher-level computation on the score value or evaluate some other
information. Now, the alignment will produce a data structure that
contains all the information and can, in addition to that, also produce
such a BLAST-like output. There is, however, still the following
problem: The data structure requires both sequences in the pair-wise
alignment to have an identical length. In case of local alignment this
is especially stupid (actually), because gaps are inserted to fill the
sequences. And then the data structure tries to keep the old sequence
coordinates, leading to the effect that the numbers "query start",
"query end", "subject start", and "subject end" are required to shift
the sequences against each other when displaying the output. So, you
cannot easily print the sequences below of each other, you first have to
shift them. Please check out the latest version of this package via
anonymeous svn and have a look ;-)
> 3. My existing research area is aiming to deal with aligning short
read (10s~100s bp) against extremely long sequences (e.g., human
genome). Af far as I know, there is not existing such alignment tools
implemented in Java. Would you consider this direction?
See, this would be very nice to include. But this requires that we no
longer fill the short sequence with many, many gap symbols (just a waist
of memory), but improve the data structure. There is already an
UnequalLenghtAlignment (just a data structure, no algorithm) and I think
we could use this as a starting point. Then your algorithm should only
produce such a data structure and this would be fine.
> 4. It seems that the existing tools is just lacking of some
refactoring and representation interfaces. Any more underlying tasks?
Hm. Yes: With the release of BioJava 3 data structures have changed
again. So maybe there's also some adaptation to the new structure required.
> I am keeping an eye on GSoC from last month, but sorry to find out
that I sent the initial email to the mailing list before I subscribe it...
Ok. Sounds good. Thanks for your interest. So I suggest: Download the
latest trunk, have a look, play around and if you can improve something
we'll put it into the trunk and write your name into the authors' tag.
Cheers
Andreas
--
Dipl.-Bioinform. Andreas Dräger
Eberhard Karls University Tübingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 Tübingen
Germany
Phone: +49-7071-29-70436
Fax: +49-7071-29-5091
More information about the Biojava-l
mailing list