[Biojava-l] Interested in the "cloudization" of BioJava

Thu Apr 5 14:39:12 UTC 2012

Hello biojava again,

After giving some thoughts about the possible ways to apply cloudization to
modules in bio-java i have identified some possibilites:
1) The first one and the one i find most interesting can be to try to
introduce the map-reduce framework to help to speed-up the pairwise
alignment in the creation of the muliple sequence alignment. I see that
biojava implements the CLUSTAL algorithm, and I have some experience with
MSA programs, and it is known that the pairwise alignment it's the most
demanding part of this algorithm when the number of sequences increases.
This version of map-reduce all-to-all sequence alignment can also be used
in the future if other progressive alignment algorithms are to be
implemented (Maybe T-COFFE or others)

2)If the input files are big enough, it can be interesting to perform the
parsing on this files while using a distributed infrastructure to speedup
the process, in this case the map reduce framework would paralelize this
process by splitting the input file in several chunks and making the
parsing of the sequences that are in each chunk.

3)Another idea can be to try to have a hadoopify version of blast, in which
the input file also can be splitted and then for each sequence in a chunk,
the node would perform a local blast query. Since bio-java doesn't
implement yet a blast version (Which i see is another GSoC project), this
idea would require to make a wrapper to execute the ncbi blast program and
then joining the results.

Thanks for your feedback, which i'm hoping in order to submit my proposal
Best regards!

On Fri, Mar 30, 2012 at 6:35 PM, Arthur Oviedo <arthur.oviedo at epfl.ch>wrote:

> Hello,
> My name is Arthur, and i'm a master student at EPFL (École Polytechnique
> Fédérale de Lausanne) in computer science.
> I worked in different project that are somewhat related to BioJava and
> cloud environment.
> I have worked , while i was research assistant, (briefly) in a project
> called UnaCloud (
> http://sistemas.uniandes.edu.co/~unacloud/dokuwiki/doku.php?id=recursos:documentacion)
> which provides an opportunistic grid/cloud infrastructure for running
> scientific experiments and we have used it to help bio-informaticians with
> their different jobs like huge BLAST queryes, HMMER jobs, etc.
> As part of my assistant work in the same university, I developed a cool
> system called UnaCloud MSA which integrates some existing and mew developed
> tools to analyze Multiple Sequence Alignments. It even uses the BioJava
> library to perform some verification about the sequences. All of this is
> also done employing the UnaCloud infrastructure. This work is still in
> development and in preparation for publication.
> http://unacloudmsa.uniandes.edu.co
> Currently, i'm working on a class project on Hadoop (An implementation of
> subset of the functionalities of a Database Manager System) using Hadoop
> (Map-reduce) framework.
> All of the mentioned projects have been implemented in Java, so i suppose
> that i meet the java expertise requirement.
> I would like to know more about this project and to know also the rough
> dates where the Google Summer of Code would be held (To prepare my
> schedule).
> Thanks and best regards,
> Arthur Oviedo
>