[Bioperl-l] question about positioning peptide in a full protein sequence

Mon Feb 21 09:26:21 UTC 2011

Hi Mingwei,

I guess this is MS data for phosphorylation sites? We are doing the same
here. I don't know what software you are using in yuor MS pipeline but
it may already map the peptides to the full-length protein for you. If
not, you probably get peptide sequences with the probabilities of a site
carrying a phosphate (or whatever post-translational modification)
encoded in the string, e.g the data I'm working with will show me
something like "..LKS[0.99]S[0.01]..." to indicate probabilities of 99%
and 1% of those two serines being modified. You then have to extract
that data from the peptide string using a regex. Then you can identifiy
the most probable site within the string and map the peptide string to
the full-length protein sequence using index (or a regex) as Chris
suggested. You can then calculate the position of the actual modified
site from the match position of the peptide and the position of the site
within the peptide. I don't think there is any ready-made solution of
this as it is basically just simply string-matching but please do let me
knof if you are getting stuck and I can help you further.

Cheers,

Frank

On Sun, 2011-02-20 at 20:57 -0600, Chris Fields wrote:
> If this is a direct string match (no ambiguity), just use perl's index function:
> 
>        index STR,SUBSTR,POSITION
>        index STR,SUBSTR
>                The index function searches for one string within another, but
>                without the wildcard-like behavior of a full regular-expression
>                pattern match.  It returns the position of the first occurrence
>                of SUBSTR in STR at or after POSITION.  If POSITION is omitted,
>                starts searching from the beginning of the string.  POSITION
>                before the beginning of the string or after its end is treated
>                as if it were the beginning or the end, respectively.  POSITION
>                and the return value are based at 0 (or whatever you've set the
>                $[ variable to--but don't do that).  If the substring is not
>                found, "index" returns one less than the base, ordinarily "-1".
> 
> Also see here:
> 
> http://perlmeme.org/howtos/perlfunc/index_function.html
> 
> chris
> 
> On Feb 20, 2011, at 4:28 PM, Mingwei Min wrote:
> 
> > Hi Dave,
> > 
> > Thank you for your suggestion. when I said "too comple for this simple
> > job", I just thought that there might be some particular module that
> > could do this straightforwardly. I'll have a try of BLAST anyway.
> > Thank you.
> > 
> > Mingwei
> > 
> > 2011/2/20 Dave Messina <David.Messina at sbc.su.se>:
> >> Hi Mingwei,
> >> Please remember to "reply all" so others on the mailing list can follow the
> >> conversation.
> >> Unless you have some way of other way of mapping the coordinates of the
> >> sequence with the post-translational sites to the coordinates of the full
> >> sequence, I think you'll have to do a similarity search of some form.
> >> BLAST may not be best for this, given that it's sloppy with the ends of an
> >> alignment, but there are plenty of options for BLAST that may improve your
> >> results. Again, you'll need to be specific about your problem for us to
> >> help. I don't what "too complex for this simple job" means. Is it too slow?
> >> Are you getting too many hits?
> >> 
> >> 
> >> Dave
> >> 
> >> 
> >> On Sun, Feb 20, 2011 at 22:35, Mingwei Min <mm809 at cam.ac.uk> wrote:
> >>> 
> >>> Hi Dave,
> >>> 
> >>> Sorry for not making it clear. Yes, I just want to get the coordinates
> >>> of the post-translational sites out of a protein sequence. And what I
> >>> have now is the peptide sequence with marker on the post-translated
> >>> residue... what should i do to map them to the whole protein sequence
> >>> and get the coordinates? The only way I could come up with is blast.
> >>> But it seems to be too complex for this simple job....
> >>> 
> >>> Many thanks,
> >>> 
> >>> Mingwei
> >>> 
> >>> 2011/2/20 Dave Messina <David.Messina at sbc.su.se>:
> >>>> Hi Mingwei,
> >>>> I'm not sure what you mean by "positioning" here. Do you want to get the
> >>>> coordinates of the post-translational sites out of a protein sequence
> >>>> database record? Or do you want to draw the post-translational sites on
> >>>> a
> >>>> picture of the protein sequence? Or something else entirely?
> >>>> 
> >>>> Dave
> >>>> 
> >>>> 
> >>>> 
> >>>> On Sat, Feb 19, 2011 at 15:53, Mingwei Min <mm809 at cam.ac.uk> wrote:
> >>>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> I am trying to positioning some post-tranlational modification sites,
> >>>>> which is marked in peptides, in a full length protein sequence. Anyone
> >>>>> would be kind to tell me the model I could use for this?
> >>>>> 
> >>>>> Many thanks
> >>>>> 
> >>>>> Mingwei
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>> 
> >>>> 
> >> 
> >> 
> > 
> > 
> > 
> > -- 
> > Mingwei Min  PhD student
> > University of Cambridge
> > Department of Genetics
> > Downing St
> > CB2 3EH
> > UK
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.