[Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"]

Fri Apr 16 13:57:41 UTC 2010

Dear Sir,

I am very interested in contributing to this project.

I am looking for a good problem,more on the research side. I can also
help in coding (I also work as a software
engineer-j2ee/eclipse/jboss/tomcat ..

Anything that I could work on...

Regards,
Jitesh Dundas

On 4/8/10, Andreas Dräger <andreas.draeger at uni-tuebingen.de> wrote:
> Hi all,
>
> This e-mail is just for your information about somebody new, who'd like
> to contribute to our project.
>
> Cheers
> Andreas
>
>
> Subject:
> Re: Fwd: Proposing a project on "Biojava alignment lead"
> From:
> Andreas Dräger <andreas.draeger at uni-tuebingen.de>
> Date:
> Wed, 07 Apr 2010 09:27:13 +0200
> To:
> Cai Shaojiang <caishaojiang at gmail.com>
>
> Hi Cai Shaojiang,
>
> Thank you for you e-mail! I don't know what happened to the e-mail list.
> Sometimes it takes a while due to the spam filters, I guess.
>
>  > I am a PhD student from National University of Singapore. My major
> research area is local alignment algorithms and data structures for SNP
> identification. And I have used Java and Eclipse for years for software
> development. I am very interested in your GSoC programme. I find that
> there is a module called "biojava-alignment lead" whose mentor is you. I
> want to propose a new project on this module. I have several questions
> about this module.
>
> Yes, that's me. So great to get your support.
>
>  > 1. It seems that pairwise alignment is to find similarity between two
> short sequences. Existing pairwise alignment is based on dynamic
> programming, is it Smith-Waterman algorithm?
>
> So, currently, BioJava contains three different alignment approaches.
> There are two deterministic algorithms, i.e., Smith-Waterman for local
> alignment and Needleman-Wunsch for global alignment. Third, there is the
> possibility to apply Hidden Markov Models for alignment. An example of
> the latter approach should be in the cookbook.
>
>  > 2. What is the exact task of "refactoring of underlying data structures"?
>
> Yes, this is something, I did last week already but it could still be
> improved. The problem was that the alignment algorithms actually
> produced a kind of string that looks similar to the output of BLAST.
> This string contained the score, the computation time, the length of the
> alignment etc. The problem was that people wanted to perform
> higher-level computation on the score value or evaluate some other
> information. Now, the alignment will produce a data structure that
> contains all the information and can, in addition to that, also produce
> such a BLAST-like output. There is, however, still the following
> problem: The data structure requires both sequences in the pair-wise
> alignment to have an identical length. In case of local alignment this
> is especially stupid (actually), because gaps are inserted to fill the
> sequences. And then the data structure tries to keep the old sequence
> coordinates, leading to the effect that the numbers "query start",
> "query end", "subject start", and "subject end" are required to shift
> the sequences against each other when displaying the output. So, you
> cannot easily print the sequences below of each other, you first have to
> shift them. Please check out the latest version of this package via
> anonymeous svn and have a look ;-)
>
>  > 3. My existing research area is aiming to deal with aligning short
> read (10s~100s bp) against extremely long sequences (e.g., human
> genome). Af far as I know, there is not existing such alignment tools
> implemented in Java. Would you consider this direction?
>
> See, this would be very nice to include. But this requires that we no
> longer fill the short sequence with many, many gap symbols (just a waist
> of memory), but improve the data structure. There is already an
> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
> we could use this as a starting point. Then your algorithm should only
> produce such a data structure and this would be fine.
>
>  > 4. It seems that the existing tools is just lacking of some
> refactoring and representation interfaces. Any more underlying tasks?
>
> Hm. Yes: With the release of BioJava 3 data structures have changed
> again. So maybe there's also some adaptation to the new structure required.
>
>  > I am keeping an eye on GSoC from last month, but sorry to find out
> that I sent the initial email to the mailing list before I subscribe it...
>
> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> latest trunk, have a look, play around and if you can improve something
> we'll put it into the trunk and write your name into the authors' tag.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dräger
> Eberhard Karls University Tübingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 Tübingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>