[Bioperl-l] A perl regex query
Stefan Kirov
stefan.kirov at bms.com
Tue Sep 18 13:05:16 UTC 2007
neeti somaiya wrote:
> My actual problem is a bit more complicated.
> It is not just one string, nut lakhs of them, they are actually names of
> chemical compounds.
>
> THe problem is there are 2 different data sources, I need to match the
> compond names between them, but the problem is though the compound may be
> the same in the two, they use different naming formats for them.
>
> eg 1 : Glucose
> DB1 : D-glucose
> DB2 : alpha-D-Glucose
>
> eg2 : 2,3-bisphosphoglycerate
> DB1 : Cyclic-2,3-bisphospho-D-Glycerate
> DB2 : 2,3 bisphoshpglycerate
>
It seems to me you are trying to match 2 collections of chemical
compounds. If you need to do this reliably you need to use canonical
smiles (perhaps there are other solutions but I am not aware of them).
There are many resources for that, including open-source:
http://openbabel.sourceforge.net/wiki/Main_Page
It is not really bioperl's cup of tea, this is much more a
chemi-informatics problem. I am not sure if there is a need for bioperl
to be extended this way- any thoughts on that?
Hope this helps, regards
Stefan
> And there are some simple examples, there are even more complicated ones,
> with many digits, alhas, betas, hyphens, S, R, cis, trans etc etc.
>
> I just want to see if the basic compond is the same, i.e. the first one will
> be glucose and second one will be 2,3-biphosphoglycerate (can't take just
> bisphosphoglycerate because 1,3-bisphosphoglycerate would mean something
> else).
>
> Anyone has any suggestions how to tackle this?
>
> Thanks.
>
> On 9/18/07, Spiros Denaxas <spiros at lokku.com> wrote:
>
>> Its not impossibe, you just have to use \b to denote the word boundaries
>> :)
>>
>> echo 'this-is-a_test-D-string-D' | perl -ne ' s/\b\-D\-\b//g ; print ;'
>>
>> this-is-a_teststring-D
>>
>> It only gets rid of -D- , all other occurrences of D and - remain intact.
>>
>> Spiros
>>
>>
>> On 9/18/07, neeti somaiya <neetisomaiya at gmail.com> wrote:
>>
>>> Thanks.
>>> It might work, but not always, because the string could be somthing like
>>> Cyclic-2,3-Bisphospho-D-Glycerate.
>>> Here I will first convert the full thing to a lower case and would then
>>>
>> try
>>
>>> to get what I want.
>>>
>>> Nothing seems to work, when I try to substitute -D- with nothing, "D"
>>>
>> and
>>
>>> "-" when occuring separately also get substituted with nothing.
>>>
>>> On 9/18/07, Roy Chaudhuri <rrc22 at cam.ac.uk> wrote:
>>>
>>>>> This isnt really a bioperl query.
>>>>> But does anyone know how I can substitute all special characters (+
>>>>>
>> some
>>
>>>>> other things) in a string with nothing in perl?
>>>>> I mean if I have a string like Cyclic-2,3-bisphospho-D-glycerate and
>>>>>
>> I
>>
>>>> want
>>>>
>>>>> ouput as bisphosphoglycerate. I want to remove -D-, Cyclic-, 2,3-
>>>>>
>> etc.
>>
>>>> A more general approach that might work is to keep lower case words (I
>>>> don't know if that will be true for all your cases):
>>>>
>>>> $_='Cyclic-2,3-bisphospho-D-glycerate';
>>>> print join '', /\b[a-z]+\b/g;
>>>>
>>>> Roy.
>>>> --
>>>> Dr. Roy Chaudhuri
>>>> Department of Veterinary Medicine
>>>> University of Cambridge, U.K.
>>>>
>>>>
>>>
>>> --
>>> -Neeti
>>> Even my blood says, B positive
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>
>
>
>
More information about the Bioperl-l
mailing list