[Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"]

Fri Apr 16 17:28:33 UTC 2010

A great place to start finding ideas is the wiki.
Both http://biojava.org/wiki/BioJava:Modules
and http://biojava.org/wiki/BioJava3_Proposal
list the next steps planned/desired for BioJava.

What research area did you have in mind?

Have fun,
Mark

On 4/16/2010 8:57 AM, jitesh dundas wrote:
> Dear Sir,
>
> I am very interested in contributing to this project.
>
> I am looking for a good problem,more on the research side. I can also
> help in coding (I also work as a software
> engineer-j2ee/eclipse/jboss/tomcat ..
>
> Anything that I could work on...
>
> Regards,
> Jitesh Dundas
>
> On 4/8/10, Andreas Dräger<andreas.draeger at uni-tuebingen.de>  wrote:
>> Hi all,
>>
>> This e-mail is just for your information about somebody new, who'd like
>> to contribute to our project.
>>
>> Cheers
>> Andreas
>>
>>
>> Subject:
>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>> From:
>> Andreas Dräger<andreas.draeger at uni-tuebingen.de>
>> Date:
>> Wed, 07 Apr 2010 09:27:13 +0200
>> To:
>> Cai Shaojiang<caishaojiang at gmail.com>
>>
>> Hi Cai Shaojiang,
>>
>> Thank you for you e-mail! I don't know what happened to the e-mail list.
>> Sometimes it takes a while due to the spam filters, I guess.
>>
>>   >  I am a PhD student from National University of Singapore. My major
>> research area is local alignment algorithms and data structures for SNP
>> identification. And I have used Java and Eclipse for years for software
>> development. I am very interested in your GSoC programme. I find that
>> there is a module called "biojava-alignment lead" whose mentor is you. I
>> want to propose a new project on this module. I have several questions
>> about this module.
>>
>> Yes, that's me. So great to get your support.
>>
>>   >  1. It seems that pairwise alignment is to find similarity between two
>> short sequences. Existing pairwise alignment is based on dynamic
>> programming, is it Smith-Waterman algorithm?
>>
>> So, currently, BioJava contains three different alignment approaches.
>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>> alignment and Needleman-Wunsch for global alignment. Third, there is the
>> possibility to apply Hidden Markov Models for alignment. An example of
>> the latter approach should be in the cookbook.
>>
>>   >  2. What is the exact task of "refactoring of underlying data structures"?
>>
>> Yes, this is something, I did last week already but it could still be
>> improved. The problem was that the alignment algorithms actually
>> produced a kind of string that looks similar to the output of BLAST.
>> This string contained the score, the computation time, the length of the
>> alignment etc. The problem was that people wanted to perform
>> higher-level computation on the score value or evaluate some other
>> information. Now, the alignment will produce a data structure that
>> contains all the information and can, in addition to that, also produce
>> such a BLAST-like output. There is, however, still the following
>> problem: The data structure requires both sequences in the pair-wise
>> alignment to have an identical length. In case of local alignment this
>> is especially stupid (actually), because gaps are inserted to fill the
>> sequences. And then the data structure tries to keep the old sequence
>> coordinates, leading to the effect that the numbers "query start",
>> "query end", "subject start", and "subject end" are required to shift
>> the sequences against each other when displaying the output. So, you
>> cannot easily print the sequences below of each other, you first have to
>> shift them. Please check out the latest version of this package via
>> anonymeous svn and have a look ;-)
>>
>>   >  3. My existing research area is aiming to deal with aligning short
>> read (10s~100s bp) against extremely long sequences (e.g., human
>> genome). Af far as I know, there is not existing such alignment tools
>> implemented in Java. Would you consider this direction?
>>
>> See, this would be very nice to include. But this requires that we no
>> longer fill the short sequence with many, many gap symbols (just a waist
>> of memory), but improve the data structure. There is already an
>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
>> we could use this as a starting point. Then your algorithm should only
>> produce such a data structure and this would be fine.
>>
>>   >  4. It seems that the existing tools is just lacking of some
>> refactoring and representation interfaces. Any more underlying tasks?
>>
>> Hm. Yes: With the release of BioJava 3 data structures have changed
>> again. So maybe there's also some adaptation to the new structure required.
>>
>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>> that I sent the initial email to the mailing list before I subscribe it...
>>
>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>> latest trunk, have a look, play around and if you can improve something
>> we'll put it into the trunk and write your name into the authors' tag.
>>
>> Cheers
>> Andreas
>>
>> --
>> Dipl.-Bioinform. Andreas Dräger
>> Eberhard Karls University Tübingen
>> Center for Bioinformatics (ZBIT)
>> Sand 1
>> 72076 Tübingen
>> Germany
>>
>> Phone: +49-7071-29-70436
>> Fax:   +49-7071-29-5091
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l