[Bioperl-l] A perl regex query

Stéphane Téletchéa stephane.teletchea at jouy.inra.fr
Tue Sep 18 13:48:05 UTC 2007


neeti somaiya a écrit :
> My actual problem is a bit more complicated.
> It is not just one string, nut lakhs of them, they are actually names of
> chemical compounds.
> 
> THe problem is there are 2 different data sources, I need to match the
> compond names between them, but the problem is though the compound may be
> the same in the two, they use different naming formats for them.
> 
> eg 1 : Glucose
> DB1 : D-glucose
> DB2 : alpha-D-Glucose
> 
> eg2 : 2,3-bisphosphoglycerate
> DB1 : Cyclic-2,3-bisphospho-D-Glycerate
> DB2 : 2,3 bisphoshpglycerate
> 
> And there are some simple examples, there are even more complicated ones,
> with many digits, alhas, betas, hyphens, S, R, cis, trans etc etc.
> 
> I just want to see if the basic compond is the same, i.e. the first one will
> be glucose and second one will be 2,3-biphosphoglycerate (can't take just
> bisphosphoglycerate because 1,3-bisphosphoglycerate would mean something
> else).
> 
> Anyone has any suggestions how to tackle this?
> 

I would use a two step approach :
1 - filter the entries, use a convention, for instance translata all '+' 
into their 'plus' literal equivalent, change spaces by '_', change all 
'-' for '_' also, etc
2 - try matching the result, if the match does not work, try to match 
some characters (for instance, try to remove all non alphabetical 
characters and see if the resulting produces a match).

That's theory, now, you have some time for errors and trials, but i 
think there is not essay, one shot solution, neither a bioperl facility 
for handling (bio)chemical compounds.

Cheers,
Stéphane

-- 
Stéphane Téletchéa, PhD.                  http://www.steletch.org
Unité Mathématique Informatique et Génome http://migale.jouy.inra.fr/mig
INRA, Domaine de Vilvert                  Tél : (33) 134 652 891
78352 Jouy-en-Josas cedex, France         Fax : (33) 134 652 901



More information about the Bioperl-l mailing list