[Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"]

Thu Apr 8 07:13:17 UTC 2010

Hi all,

This e-mail is just for your information about somebody new, who'd like 
to contribute to our project.

Cheers
Andreas

Subject:
Re: Fwd: Proposing a project on "Biojava alignment lead"
From:
Andreas Dräger <andreas.draeger at uni-tuebingen.de>
Date:
Wed, 07 Apr 2010 09:27:13 +0200
To:
Cai Shaojiang <caishaojiang at gmail.com>

Hi Cai Shaojiang,

Thank you for you e-mail! I don't know what happened to the e-mail list. 
Sometimes it takes a while due to the spam filters, I guess.

 > I am a PhD student from National University of Singapore. My major 
research area is local alignment algorithms and data structures for SNP 
identification. And I have used Java and Eclipse for years for software 
development. I am very interested in your GSoC programme. I find that 
there is a module called "biojava-alignment lead" whose mentor is you. I 
want to propose a new project on this module. I have several questions 
about this module.

Yes, that's me. So great to get your support.

 > 1. It seems that pairwise alignment is to find similarity between two 
short sequences. Existing pairwise alignment is based on dynamic 
programming, is it Smith-Waterman algorithm?

So, currently, BioJava contains three different alignment approaches. 
There are two deterministic algorithms, i.e., Smith-Waterman for local 
alignment and Needleman-Wunsch for global alignment. Third, there is the 
possibility to apply Hidden Markov Models for alignment. An example of 
the latter approach should be in the cookbook.

 > 2. What is the exact task of "refactoring of underlying data structures"?

Yes, this is something, I did last week already but it could still be 
improved. The problem was that the alignment algorithms actually 
produced a kind of string that looks similar to the output of BLAST. 
This string contained the score, the computation time, the length of the 
alignment etc. The problem was that people wanted to perform 
higher-level computation on the score value or evaluate some other 
information. Now, the alignment will produce a data structure that 
contains all the information and can, in addition to that, also produce 
such a BLAST-like output. There is, however, still the following 
problem: The data structure requires both sequences in the pair-wise 
alignment to have an identical length. In case of local alignment this 
is especially stupid (actually), because gaps are inserted to fill the 
sequences. And then the data structure tries to keep the old sequence 
coordinates, leading to the effect that the numbers "query start", 
"query end", "subject start", and "subject end" are required to shift 
the sequences against each other when displaying the output. So, you 
cannot easily print the sequences below of each other, you first have to 
shift them. Please check out the latest version of this package via 
anonymeous svn and have a look ;-)

 > 3. My existing research area is aiming to deal with aligning short 
read (10s~100s bp) against extremely long sequences (e.g., human 
genome). Af far as I know, there is not existing such alignment tools 
implemented in Java. Would you consider this direction?

See, this would be very nice to include. But this requires that we no 
longer fill the short sequence with many, many gap symbols (just a waist 
of memory), but improve the data structure. There is already an 
UnequalLenghtAlignment (just a data structure, no algorithm) and I think 
we could use this as a starting point. Then your algorithm should only 
produce such a data structure and this would be fine.

 > 4. It seems that the existing tools is just lacking of some 
refactoring and representation interfaces. Any more underlying tasks?

Hm. Yes: With the release of BioJava 3 data structures have changed 
again. So maybe there's also some adaptation to the new structure required.

 > I am keeping an eye on GSoC from last month, but sorry to find out 
that I sent the initial email to the mailing list before I subscribe it...

Ok. Sounds good. Thanks for your interest. So I suggest: Download the 
latest trunk, have a look, play around and if you can improve something 
we'll put it into the trunk and write your name into the authors' tag.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dräger
Eberhard Karls University Tübingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 Tübingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091