[Biojava-dev] biojava 3 progress

Wed Mar 17 15:52:01 UTC 2010

Andy

Working on it at the moment. I am starting with some code I have been using from JavaGene that has a fairly good handle of gff parsing and handling negative strands. I am migrating to a new project called biojava3-genes(local only at the moment) where code related to gff parsing and dealing with various gene prediction program outputs can be used. I need to create a training file for GlimmerHMM so the short term goal is to take a XML blast output of predicted genes that match uniprot and then extract the exon features from DNASequences with exon features added from a gff file. I will then use these validated exon features to create the GlimmerHMM training file. The complexity of exon features with negative strand and frame shifts with the ability to splice together a coding sequence is probably the most complicated feature example we will encounter. After I get through that I will see what can be extended/refactored etc for other more generic features.

I also have some code to gather genome characteristics GC percent, avg gene length, etc. that can be included in the biojava3-genes module. I wanted to see if you know how Average Number of Introns per gene is calculated when a gene has no introns. Do you add a 0 to the average or only include genes with at least one intron in the average?

Can you think of a better name for a package that deals with gff,gff3 parsing and utilities to work with various gene prediction inputs/outputs?

Scooter

On Mar 17, 2010, at 11:28 AM, Andy Yates wrote:

> I think features are possible & this is really the missing piece of the puzzle with this project. How far on are you with them Scooter?
> 
> On 16 Mar 2010, at 20:58, Andreas Prlic wrote:
> 
>> Ok, cool. Thanks for all this state-of-the-art pushing there... Which parts do you think would be feasible to finish,  if we would say we are planning a release  e.g. early May ? We can have a follow-up to this release once the next round of features have been added. Probably it  makes sense to focus on stabilizing what is currently there and documenting it, rather than trying to be feature-complete. Critical features that are still missing should be added of course... 
>> 
>> Andreas
>> 
>> On Tue, Mar 16, 2010 at 11:51 AM, Scooter Willis <HWillis at scripps.edu> wrote:
>> I am working on adding in additional features to the core module to round things out and will be able to do docs/wiki examples. I will be working on Features with the new sequence model and the ability to pull features from uniprot based on uniprot id as an example. I will use uniprot XML as the data model when figuring out the feature data model such that classes have biology relevance instead of being completely abstract.
>> 
>> I will also see if I can do something with NCBI for genome sequence data where you don't need to download the entire sequence but based on gff annotations you can pull dna sequences for exons belonging to a particular gene.
>> 
>> I will also plan on migrating the sequence alignment code as well.
>> 
>> I think the focus for this release should be on the modularization of the modules and the maven integration. We also need to provide a repository for those who are not going to use maven and need just the jar files. We can then highlight the newer modules as a benefit of the modularization.
>> 
>> I am planning on attending ISMB/BOSC.
>> 
>> Do we want to put some deadlines in place with a mini-project plan?
>> 
>> Thanks
>> 
>> Scooter
>> 
>> 
>> On Mar 16, 2010, at 1:21 PM, Andy Yates wrote:
>> 
>>> It's getting ready very slowly. Currently we need:
>>> 
>>> * Locations correctly implemented
>>> ** There's no way of requesting subseqs from them atmo
>>> * Feature on sequences support
>>> * Extra attributes which do not fit into top-level attributes
>>> * Mapping between sequences/assemblies
>>> * circular location support
>>> ** so no checks on start being less than end
>>> * Documentation
>>> 
>>> Think that's it off the top of my head
>>> 
>>> Andy
>>> 
>>> On 16 Mar 2010, at 15:57, Andreas Prlic wrote:
>>> 
>>>> Hi,
>>>> 
>>>> ISMB/BOSC is coming up rapidly and we should start to prepare for the annual
>>>> BioJava release. As such it would be a good moment to discuss the current
>>>> status of the various new BioJava 3 modules.
>>>> 
>>>> The biojava-structure, biojava-structure-gui modules are essentially ready
>>>> for release and I started to update the Cookbook with the latest features
>>>> http://biojava.org/wiki/BioJava:CookBook:PDB:align
>>>> 
>>>> Some of the re-factored modules based on biojava 1.7 could be released
>>>> anytime soon as well. The documentation just needs to be updated to explain
>>>> where the functionality can be found now (e.g. alignment module)
>>>> 
>>>> What about the new code that has been under development since the hackathon?
>>>> Is it getting release ready slowly? Any plans for documentation? What is
>>>> missing before we can make the first Biojava 3 release?
>>>> 
>>>> Andreas
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> 
>>> --
>>> Andrew Yates                   Ensembl Genomes Engineer
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>> 
>> 
> 
> -- 
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
>