[Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"]

Andreas Prlic andreas at sdsc.edu
Mon Apr 19 04:14:24 UTC 2010


Hi Jitesh,

BioJava is an open source project with the goal to support Bioinformatics
applications. While we are always happy about any contribution, be it
documentation, bug fixes or email support on the mailing list, for a
research relate project it is probably easier to team up with your local
university and do an internship there.

Andreas



On Sat, Apr 17, 2010 at 6:34 AM, jitesh dundas <jbdundas at gmail.com> wrote:

> Dear SIr,
>
> Could anyone tell me where I could start? Is there any lead who might need
> my help in Software Development and research-oriebted aspects?
>
> Any comments on my previous emails would be most welcomed...
>
> Regards,
> JItesh Dundas
>
>
> On 4/8/10, Andreas Dräger <andreas.draeger at uni-tuebingen.de> wrote:
> >
> > Hi all,
> >
> > This e-mail is just for your information about somebody new, who'd like
> to
> > contribute to our project.
> >
> > Cheers
> > Andreas
> >
> >
> > Subject:
> > Re: Fwd: Proposing a project on "Biojava alignment lead"
> > From:
> > Andreas Dräger <andreas.draeger at uni-tuebingen.de>
> > Date:
> > Wed, 07 Apr 2010 09:27:13 +0200
> > To:
> > Cai Shaojiang <caishaojiang at gmail.com>
> >
> > Hi Cai Shaojiang,
> >
> > Thank you for you e-mail! I don't know what happened to the e-mail list.
> > Sometimes it takes a while due to the spam filters, I guess.
> >
> > > I am a PhD student from National University of Singapore. My major
> > research area is local alignment algorithms and data structures for SNP
> > identification. And I have used Java and Eclipse for years for software
> > development. I am very interested in your GSoC programme. I find that
> there
> > is a module called "biojava-alignment lead" whose mentor is you. I want
> to
> > propose a new project on this module. I have several questions about this
> > module.
> >
> > Yes, that's me. So great to get your support.
> >
> > > 1. It seems that pairwise alignment is to find similarity between two
> > short sequences. Existing pairwise alignment is based on dynamic
> > programming, is it Smith-Waterman algorithm?
> >
> > So, currently, BioJava contains three different alignment approaches.
> > There are two deterministic algorithms, i.e., Smith-Waterman for local
> > alignment and Needleman-Wunsch for global alignment. Third, there is the
> > possibility to apply Hidden Markov Models for alignment. An example of
> the
> > latter approach should be in the cookbook.
> >
> > > 2. What is the exact task of "refactoring of underlying data
> structures"?
> >
> > Yes, this is something, I did last week already but it could still be
> > improved. The problem was that the alignment algorithms actually produced
> a
> > kind of string that looks similar to the output of BLAST. This string
> > contained the score, the computation time, the length of the alignment
> etc.
> > The problem was that people wanted to perform higher-level computation on
> > the score value or evaluate some other information. Now, the alignment
> will
> > produce a data structure that contains all the information and can, in
> > addition to that, also produce such a BLAST-like output. There is,
> however,
> > still the following problem: The data structure requires both sequences
> in
> > the pair-wise alignment to have an identical length. In case of local
> > alignment this is especially stupid (actually), because gaps are inserted
> to
> > fill the sequences. And then the data structure tries to keep the old
> > sequence coordinates, leading to the effect that the numbers "query
> start",
> > "query end", "subject start", and "subject end" are required to shift the
> > sequences against each other when displaying the output. So, you cannot
> > easily print the sequences below of each other, you first have to shift
> > them. Please check out the latest version of this package via anonymeous
> svn
> > and have a look ;-)
> >
> > > 3. My existing research area is aiming to deal with aligning short read
> > (10s~100s bp) against extremely long sequences (e.g., human genome). Af
> far
> > as I know, there is not existing such alignment tools implemented in
> Java.
> > Would you consider this direction?
> >
> > See, this would be very nice to include. But this requires that we no
> > longer fill the short sequence with many, many gap symbols (just a waist
> of
> > memory), but improve the data structure. There is already an
> > UnequalLenghtAlignment (just a data structure, no algorithm) and I think
> we
> > could use this as a starting point. Then your algorithm should only
> produce
> > such a data structure and this would be fine.
> >
> > > 4. It seems that the existing tools is just lacking of some refactoring
> > and representation interfaces. Any more underlying tasks?
> >
> > Hm. Yes: With the release of BioJava 3 data structures have changed
> again.
> > So maybe there's also some adaptation to the new structure required.
> >
> > > I am keeping an eye on GSoC from last month, but sorry to find out that
> I
> > sent the initial email to the mailing list before I subscribe it...
> >
> > Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> > latest trunk, have a look, play around and if you can improve something
> > we'll put it into the trunk and write your name into the authors' tag.
> >
> > Cheers
> > Andreas
> >
> > --
> > Dipl.-Bioinform. Andreas Dräger
> > Eberhard Karls University Tübingen
> > Center for Bioinformatics (ZBIT)
> > Sand 1
> > 72076 Tübingen
> > Germany
> >
> > Phone: +49-7071-29-70436
> > Fax:   +49-7071-29-5091
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------




More information about the Biojava-l mailing list