[Biojava-dev] Introducing the GSoC project

Peter Rose pwrose at ucsd.edu
Wed May 11 21:06:32 UTC 2011


Hi Scooter,

If this type of flexibility is required, I suggest to split up the Element class into two classes:

1.       Element (an Enumeration, which only contains immutable data such as atomic number, valence electrons, etc.)

2.       ElementData (contains data sets loaded from an XML file)

For the mono-isotopic mass I suggest to add the number of neutrons to the descriptions. For some mass spec. analysis other isotopes such as 13C and 34S may be required in addition to the most abundant isotopes. The only other data that I'm currently using are the covalent radii. These could be added to the XML file as well from the resource I mentioned earlier in this email track.

Data for amino acids and other compounds I think should be handled in a separate XML file.

-Peter



From: Scooter Willis [mailto:HWillis at scripps.edu]
Sent: Wednesday, May 11, 2011 1:11 PM
To: Peter Rose; Andreas Prlic
Cc: Chuan Hock Koh; Sergio Pulido; biojava-dev
Subject: Re: [Biojava-dev] Introducing the GSoC project

Peter

I have attached the XML file I put together for an internal software project I developed for finding peptides in mass spec data. I wanted reasonable flexibility for mass depending on ever increasing "high resolution" mass spec equipment and the ability to handle a dynamic list of PTM's where the user could make changes. If we had a java interface to model the data elements described in the XML then that java interface can load the actual data from the  XML, a flat file, or user choice via an interface. In the XML you will see that I give the composition of each amino acid within the context of a peptide and then children modifications can be easily reflected in the PTM based on the changes to the amino acid. A key challenge is the string representation of a PTM so that you can create a protein from a string and still do a MSA with a protein who doesn't have the PTM. The notation I was using in the lab that seems to work is DE[pS]DE[pT] with details in the XML of different examples. If we go this notational route then when you create a Protein from a string it is easy to parse for the correct internal representation and then you could do an MSA of ProteinSequences.

The attached file is an example of one possible way to model the data attributes of the amino acids. Probably doesn't make sense to have everything in one model as it makes it harder to change a class of attributes.

Thanks

Scooter



From: Peter Rose <pwrose at ucsd.edu<mailto:pwrose at ucsd.edu>>
Date: Wed, 11 May 2011 13:09:35 -0400
To: Peter Rose <pwrose at ucsd.edu<mailto:pwrose at ucsd.edu>>, Andreas Prlic <andreas at sdsc.edu<mailto:andreas at sdsc.edu>>
Cc: biojava-dev <biojava-dev at lists.open-bio.org<mailto:biojava-dev at lists.open-bio.org>>, Peter Rose <pwrose at ucsd.edu<mailto:pwrose at ucsd.edu>>, Chuan Hock Koh <kohchuanhock at gmail.com<mailto:kohchuanhock at gmail.com>>, Scooter Willis <hwillis at scripps.edu<mailto:hwillis at scripps.edu>>, Sergio Pulido <spulido99 at gmail.com<mailto:spulido99 at gmail.com>>
Subject: RE: [Biojava-dev] Introducing the GSoC project

I just talked to a mass spec. guy here at UCSD, and he recommends the NIST data (averaged and mono-isotopic masses of the elements): http://physics.nist.gov/cgi-bin/Compositions/stand_alone.pl?ele=&ascii=html&isotype=some as the authoritative resource for these data.

-Peter

From: Peter Rose [mailto:pwrose at sdsc.edu]
Sent: Monday, May 09, 2011 1:03 PM
To: Andreas Prlic
Cc: biojava-dev; Rose, Peter; Chuan Hock Koh; Scooter Willis; Sergio Pulido
Subject: RE: [Biojava-dev] Introducing the GSoC project

A few more comments on the Element class:


1.       We are using the covalent radii in the protmod package. The covalent radii came from a variety of sources and are not documented. To have a consistent and up to date set of data, I suggest we use the data from: Beatriz Cordero, Verónica Gómez, Ana E. Platero-Prats, Marc Revés, Jorge Echeverría, Eduard Cremades, Flavia Barragán and Santiago Alvarez, in "Covalent radii revisited", Dalton Trans., 2008, [DOI: 10.1039/b801115j].

2.       The Element class contains a few other attributes that should be removed for now, until they are used somewhere and have been properly checked/updated. This includes the following attributes: minimumValence, maximumValence, commonValence, maximumCommonValence, and oxidationState. Since many elements have multiple oxidation states, etc. I think these values are of limited use. The only reason they are there since I used this class in another application, but they may not be useful in the context of BioJava.

3.       hillOrder, valenceElectronCount, and coreElectronCount are well defined and can stay. The source for paulingElectronegativity is mentioned in the code, so I think that can stay as is as well.

-Peter

From: Sergio Pulido [mailto:spulido99 at gmail.com]
Sent: Monday, May 09, 2011 12:49 PM
To: Andreas Prlic
Cc: biojava-dev; Rose, Peter; Chuan Hock Koh; Scooter Willis
Subject: Re: [Biojava-dev] Introducing the GSoC project


2.       The precision of some of the numbers may exceed float. If so,
should we use double?

I would suggest to use BigDecimal, given that you will use it for very precise calculations, no one wants the java floating point error making damages.




More information about the biojava-dev mailing list