[Biojava-dev] Protein sequence composition
Mark Chapman
chapman at cs.wisc.edu
Sat Apr 3 13:08:23 UTC 2010
Hi Radwen,
The example below solves most of what you asked for. It may not be the most
elegant solution, but it should get you started in the right direction.
Saving the proteins into sample.fasta and running the following command:
> java ProteinComposition sample.fasta PAGST EDNQ LIVM KRH C
produces the output:
SEQ1 247: 0.36032388 0.19433199 0.25506073 0.097165994 0.0
SEQ2 267: 0.3445693 0.20224719 0.23595506 0.101123594 0.007490637
Take care,
Mark
-- ProteinComposition.java --
import java.io.*;
import java.util.NoSuchElementException;
import org.biojava.bio.BioException;
import org.biojava.bio.seq.*;
import org.biojava.bio.seq.db.SequenceDB;
import org.biojava.bio.seq.io.SeqIOTools;
import org.biojava.bio.symbol.*;
@SuppressWarnings("deprecation")
public class ProteinComposition {
/**
* Determines the composition of proteins in a Fasta file.
* @param args <filename> <residues>...
* <filename>: file name of the Fasta file
* <residues>: group(s) of one or more amino acid residues, statistics are
printed out for each group
*/
public static void main(String[] args) {
try {
// load Fasta file into memory
BufferedInputStream is = new BufferedInputStream(new
FileInputStream(args[0]));
Alphabet alpha = AlphabetManager.alphabetForName("PROTEIN");
SequenceDB db = SeqIOTools.readFasta(is, alpha);
// load command line arguments into memory
SymbolList[] res = new SymbolList[args.length-1];
for (int a = 1; a < args.length; a++)
res[a-1] = ProteinTools.createProtein(args[a]);
// store length and composition of each protein
int[] lengths = new int[db.ids().size()];
int[][] counts = new int[lengths.length][res.length];
float[][] means = new float[lengths.length][res.length];
// iterate over proteins in Fasta file
SequenceIterator sI = db.sequenceIterator();
for (int s = 0; sI.hasNext(); s++) {
Sequence seq = sI.nextSequence();
lengths[s] = seq.length();
// iterate over each amino acid
for (Object sr : seq.toList())
// check for amino acid in each residue group
for (int a = 1; a < args.length; a++)
// iterate over each residue in group
for (Object r : res[a-1].toList())
// increment count if amino acid has a match in residue group
if (((Symbol) r).getMatches().contains((Symbol) sr)) {
counts[s][a-1]++;
break;
}
// print "name length: composition" for each protein
System.out.print(seq.getName() + "\t" + seq.length() + ":");
for (int a = 1; a < args.length; a++)
System.out.print("\t" + (means[s][a-1] = (float) counts[s][a-1] /
lengths[s]));
System.out.println();
}
} catch (FileNotFoundException ex) {
System.err.println("Problem reading file...");
ex.printStackTrace();
} catch (BioException ex) {
System.err.println("File not in fasta format or wrong alphabet...");
ex.printStackTrace();
} catch (NoSuchElementException ex) {
System.err.println("No fasta sequences in the file...");
ex.printStackTrace();
}
}
}
On 4/3/2010 5:18 AM, Radwen Aniba wrote:
> Hello,
>
> I'm writing an application that treats protein sequences, and I am using
> Biojava for a couple of things.
> One of these processings is to parse protein multifasta files, and treat the
> sequences one after the other. One of my purposes is to calculate
> composition. By composition I mean that I am interested to know in a given
> protein sequence what is the mean and the standard deviation composition of
> these groups :
>
> PAGST
> EDNQ
> LIVM
> KRH
> C
>
> example :
>
> protein fasta file :
>
>> SEQ1
>
> DVSFRLSGATSSSYGVFISNLRKALPNERKLYDIPLLRSSLPGSQRYALI
> HLTNYADETISVAIDVTNVYIMGYRAGDTSYFFNEASATEAAKYVFKDAM
> RKVTLPYSGNYERLQTAAGKIRENIPLGLPALDSAITTLFYYNANSAASA
> LMVLIQSTSEAARYKFIEQQIGKRVDKTFLPSLAIISLENSWSALSKQIQ
> IASTNNGQFESPVVLINAQNQRVTITNVDAGVVTSNIALLLNRNNMA
>
>> SEQ2
>
> IFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTGADVRHEIPVLPNRVG
> LPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQED
> AEAITHLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISA
> LYYYSTGGTQLPTLARSFIICIQMISEAARFQYIEGEMRTRIRYNRRSAP
> DPSVITLENSWGRLSTAIQESNQGAFASPIQLQRRNGSKFSVYDVSILIP
> IIALMVYRCAPPPSSQF
>
> I would like to
> 1/ parse SEQ1 to calculate the composition mean of PAGST residues for
> example ( number of residus/ length of the sequence)
> 2/ do same thing for SEQ2
> 3 / return the average mean of both sequences
> 4/ Return standard deviation of these values.
>
>
> I can do it writing a standard java code, but I would like to know (as I am
> using biojava already) if this is possible or not ( Which class / instances
> to use)
>
> Cheers
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list