[Biojava-dev] Protein sequence composition

Sat Apr 3 10:18:40 UTC 2010

Hello,

I'm writing an application that treats protein sequences, and I am using
Biojava for a couple of things.
One of these processings is to parse protein multifasta files, and treat the
sequences one after the other. One of my purposes is to calculate
composition. By composition I mean that I am interested to know in a given
protein sequence what is the mean and the standard deviation composition of
these groups :

PAGST
EDNQ
LIVM
KRH
C

example :

protein fasta file :

>SEQ1

DVSFRLSGATSSSYGVFISNLRKALPNERKLYDIPLLRSSLPGSQRYALI
HLTNYADETISVAIDVTNVYIMGYRAGDTSYFFNEASATEAAKYVFKDAM
RKVTLPYSGNYERLQTAAGKIRENIPLGLPALDSAITTLFYYNANSAASA
LMVLIQSTSEAARYKFIEQQIGKRVDKTFLPSLAIISLENSWSALSKQIQ
IASTNNGQFESPVVLINAQNQRVTITNVDAGVVTSNIALLLNRNNMA

>SEQ2

IFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTGADVRHEIPVLPNRVG
LPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQED
AEAITHLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISA
LYYYSTGGTQLPTLARSFIICIQMISEAARFQYIEGEMRTRIRYNRRSAP
DPSVITLENSWGRLSTAIQESNQGAFASPIQLQRRNGSKFSVYDVSILIP
IIALMVYRCAPPPSSQF

I would like to
1/ parse SEQ1 to calculate the composition mean of PAGST residues for
example ( number of residus/ length of the sequence)
2/ do same thing for SEQ2
3 / return the average mean of both sequences
4/ Return standard deviation of these values.

I can do it writing a standard java code, but I would like to know (as I am
using biojava already) if this is possible or not ( Which class / instances
to use)

Cheers