[Biojava-l] Mass Search Results

Michael Jones mjones@mpi.com
Fri, 04 Jan 2002 12:34:28 -0500


Dr. Old,

These are some great comments. I am interested to hear that you are working 
on an implementation of ProFound. Is there a reference somewhere to the 
algorithm?

At 11:03 AM 1/3/2002 -0700, William.Old@UCHSC.edu wrote:
>Dr. Jones,
>
>I think the time is ripe for biojava interfaces to peptide-mass
>fingerprinting search algorithms, and I am glad to see some interest
>brewing. Let me express my interest, and relate to the list some of the
>reasons. Based on what I have worked with from your previous JPAT library
>and the digestion classes in biojava, I am anxious to see what you come up
>with.
>
>I am currently working on an implementation of the ProFound algorithm,
>eventually to be used for in house peptide mass fingerprinting. The
>performance of my current implementation parallels the web-accessible
>version of ProFound at
>http://129.85.19.192/profound_bin/WebProFound.exe?FORM=1 in terms of speed
>and discriminability. I am designing it to allow the user to search over all
>possible post-translational modifications, with user-defined parameters that
>reduce the false-positive and false-negative rates which confound
>modification searching. Additional parameters include error tolerance, pI,
>protein MW, etc. Because the size of the search space and complexity of the
>Bayesian calculations, it is written entirely in ANSI C on Solaris, with
>only a command line interface, and textual output. Eventually I would be
>interested in collaborating to design interfaces to the algorithm, but I am
>not sure how robust C/Java interfaces are.

I would like to see how Java performs on this algorithm. It may surprise 
us. Are you increasing the number of variable or fixed modifications. 
Mascot restricts the number of variable mods to 6 or something because 
after that the masses just start to overlap and false results become 
prominent. I have not thought about this a lot it sounds like a useful 
feature enhancement.

As far as C/Java interfaces I would recommend CORBA over JNI. I don't know 
a lot about JNI but I think a CORBA solution would be more generalizable. 
It is my impression that JNI interfaces may be a little two tightly tied to 
a particular implementation and I am not sure how system independent the 
marshaling is. Also with a CORBA interface you would have all of the web 
service like support that comes with the technology. Of course the 
interface would have to be written so that the CORBA calls and marshaling 
does not become a bottle neck. There is also the http://biocorba.org/ group 
that has interfaces that we may be able to use.

Does anyone else have any thoughts on this?


>I haven't studied the SeqSimilaritySearchResult interface very much, but my
>first impression is that you may want to create a new interface to encompass
>the large amount of data returned from a typical search. In the web version
>of ProFound (as well as the others, like MSFit), the top hits are returned
>with MW, peptide hits, errors, peptide hit sequences, as well as the
>normalized probability score. The advantage to interfacing with an algorithm
>as opposed to a results parser, would be the opportunity to capture more
>information. The scores from ProFound are displayed as normalized Bayesian
>probabilities which all add up to 1; however, in many cases, it is also
>useful to know the likelihood prior to normalization, as well as a number of
>other functions used in the probability calculation which are very useful in
>the identification process. Obtaining the likelihood for each hit allows one
>to compare scores across different searches, as well as doing statistical
>testing to estimate false-positive rates.  All in all, a clean interface to
>such an algorithm would be immensely useful, not just for doing single
>searches, but also for automation of thousands at a time, and for automated
>statistical testing as in:
>
>J. Eriksson, B.T. Chait, and D. Fenyö, "A Statistical Basis for Testing the
>Significance of Mass Spectrometric Protein Identification Results",
>Analytical Chemistry 72 (2000) 999-1005.
>
>I'm planning to publish the work soon, and subsequently would be interested
>speaking with you about the design of such interfaces.

The SeqSimilaritySearchResult has the following methods.

List getHits(); The List contains SeqSimilaritySearchHit which has methods 
to return a variety of scoring (Score, PValue, EValue) parameters, 
SeqSimilaritySearchSubHit to do alignments (Peptide Hits could be used 
here). We could add the masses into the interface or just calculate them on 
the fly.

It would be nice if SeqSimilaritySearchHit and SeqSimilaritySearchResult 
had some sort of map for miscellaneous results and error messages. Some of 
the complex scoring results you discuss above are probably to specific for 
any general interface definition but is something that should be included 
in the results. These type of results could be returned within a general 
properties map. I would like to try and use SeqSimilaritySearchResult  if 
possible just to help with cross talk as the Bio* stuff takes over the world.



>--
>Will Old, Ph.D.
>Research Associate
>Center for Computational Pharmacology
>http://compbio.uchsc.edu/
>Univ. Colo. Health Sci. Center
>303-315-1102
>William.Old@UCHSC.edu
>
>
>
>-----Original Message-----
>From: Michael Jones [mailto:mjones@mpi.com]
>Sent: Wednesday, December 26, 2001 9:36 AM
>To: biojava-l@biojava.org
>Subject: [Biojava-l] Mass Search Results
>
>
>I am thinking about creating some biojava interfaces and implementations
>for peptide-mass fingerprint and peptide fragment mass searches of sequence
>databases. I would like to make it general enough so that it could be used
>to wrap some of the popular search tools. So I need to abstract out things
>like Scoring schemes.
>
>In general the input would be a set of masses (Protein and peptide or
>Fragments and Parent peptide), error tolerance and other filters. The
>output would be a set of proteins or nucleotide sequences along with there
>associated scores and possibly with the matches as features annotated onto
>the returned sequences.
>
>I have been looking at some of the Interfaces used for FastA searches but I
>am not sure that they are appropriate for the problem above. For Example
>the SearchBuilder has as one of its methods SeqSimilaritySearchResult
>makeSearchResult(). A SeqSimilaritySearchResult has a method
>getQuerySequence() that is not appropriate for the mass search problem.
>What do people think. Should I go ahead and use them and just ignore
>getQuerySequence() or should I create new interfaces? Perhaps I could just
>extend SeqSimilaritySearchResult and add a getQueryMassSet method or just
>use the same interface and just put the masses into the SearchParameters
>Map.
>
>Also these interfaces according to the documentation seem to be designed to
>handle parsing of results but not for algorithm implementations. Is there
>some other interfaces that may be more appropriate for doing search
>algorithm implementations?
>
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l@biojava.org
>http://biojava.org/mailman/listinfo/biojava-l
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l@biojava.org
>http://biojava.org/mailman/listinfo/biojava-l