[Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"]

Sat Apr 17 13:34:20 UTC 2010

Dear SIr,

Could anyone tell me where I could start? Is there any lead who might need
my help in Software Development and research-oriebted aspects?

Any comments on my previous emails would be most welcomed...

Regards,
JItesh Dundas

On 4/8/10, Andreas Dräger <andreas.draeger at uni-tuebingen.de> wrote:
>
> Hi all,
>
> This e-mail is just for your information about somebody new, who'd like to
> contribute to our project.
>
> Cheers
> Andreas
>
>
> Subject:
> Re: Fwd: Proposing a project on "Biojava alignment lead"
> From:
> Andreas Dräger <andreas.draeger at uni-tuebingen.de>
> Date:
> Wed, 07 Apr 2010 09:27:13 +0200
> To:
> Cai Shaojiang <caishaojiang at gmail.com>
>
> Hi Cai Shaojiang,
>
> Thank you for you e-mail! I don't know what happened to the e-mail list.
> Sometimes it takes a while due to the spam filters, I guess.
>
> > I am a PhD student from National University of Singapore. My major
> research area is local alignment algorithms and data structures for SNP
> identification. And I have used Java and Eclipse for years for software
> development. I am very interested in your GSoC programme. I find that there
> is a module called "biojava-alignment lead" whose mentor is you. I want to
> propose a new project on this module. I have several questions about this
> module.
>
> Yes, that's me. So great to get your support.
>
> > 1. It seems that pairwise alignment is to find similarity between two
> short sequences. Existing pairwise alignment is based on dynamic
> programming, is it Smith-Waterman algorithm?
>
> So, currently, BioJava contains three different alignment approaches.
> There are two deterministic algorithms, i.e., Smith-Waterman for local
> alignment and Needleman-Wunsch for global alignment. Third, there is the
> possibility to apply Hidden Markov Models for alignment. An example of the
> latter approach should be in the cookbook.
>
> > 2. What is the exact task of "refactoring of underlying data structures"?
>
> Yes, this is something, I did last week already but it could still be
> improved. The problem was that the alignment algorithms actually produced a
> kind of string that looks similar to the output of BLAST. This string
> contained the score, the computation time, the length of the alignment etc.
> The problem was that people wanted to perform higher-level computation on
> the score value or evaluate some other information. Now, the alignment will
> produce a data structure that contains all the information and can, in
> addition to that, also produce such a BLAST-like output. There is, however,
> still the following problem: The data structure requires both sequences in
> the pair-wise alignment to have an identical length. In case of local
> alignment this is especially stupid (actually), because gaps are inserted to
> fill the sequences. And then the data structure tries to keep the old
> sequence coordinates, leading to the effect that the numbers "query start",
> "query end", "subject start", and "subject end" are required to shift the
> sequences against each other when displaying the output. So, you cannot
> easily print the sequences below of each other, you first have to shift
> them. Please check out the latest version of this package via anonymeous svn
> and have a look ;-)
>
> > 3. My existing research area is aiming to deal with aligning short read
> (10s~100s bp) against extremely long sequences (e.g., human genome). Af far
> as I know, there is not existing such alignment tools implemented in Java.
> Would you consider this direction?
>
> See, this would be very nice to include. But this requires that we no
> longer fill the short sequence with many, many gap symbols (just a waist of
> memory), but improve the data structure. There is already an
> UnequalLenghtAlignment (just a data structure, no algorithm) and I think we
> could use this as a starting point. Then your algorithm should only produce
> such a data structure and this would be fine.
>
> > 4. It seems that the existing tools is just lacking of some refactoring
> and representation interfaces. Any more underlying tasks?
>
> Hm. Yes: With the release of BioJava 3 data structures have changed again.
> So maybe there's also some adaptation to the new structure required.
>
> > I am keeping an eye on GSoC from last month, but sorry to find out that I
> sent the initial email to the mailing list before I subscribe it...
>
> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> latest trunk, have a look, play around and if you can improve something
> we'll put it into the trunk and write your name into the authors' tag.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dräger
> Eberhard Karls University Tübingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 Tübingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>