[Biojava-dev] Protein sequence composition

Sat Apr 3 13:08:23 UTC 2010

Hi Radwen,

The example below solves most of what you asked for.  It may not be the most 
elegant solution, but it should get you started in the right direction.

Saving the proteins into sample.fasta and running the following command:
 > java ProteinComposition sample.fasta PAGST EDNQ LIVM KRH C

produces the output:
SEQ1	247:	0.36032388	0.19433199	0.25506073	0.097165994	0.0
SEQ2	267:	0.3445693	0.20224719	0.23595506	0.101123594	0.007490637

Take care,
Mark

-- ProteinComposition.java --

import java.io.*;
import java.util.NoSuchElementException;

import org.biojava.bio.BioException;
import org.biojava.bio.seq.*;
import org.biojava.bio.seq.db.SequenceDB;
import org.biojava.bio.seq.io.SeqIOTools;
import org.biojava.bio.symbol.*;

@SuppressWarnings("deprecation")
public class ProteinComposition {

   /**
    * Determines the composition of proteins in a Fasta file.
    * @param args <filename> <residues>...
    *   <filename>: file name of the Fasta file
    *   <residues>: group(s) of one or more amino acid residues, statistics are 
printed out for each group
    */
   public static void main(String[] args) {
     try {

       // load Fasta file into memory
       BufferedInputStream is = new BufferedInputStream(new 
FileInputStream(args[0]));
       Alphabet alpha = AlphabetManager.alphabetForName("PROTEIN");
       SequenceDB db = SeqIOTools.readFasta(is, alpha);

       // load command line arguments into memory
       SymbolList[] res = new SymbolList[args.length-1];
       for (int a = 1; a < args.length; a++)
         res[a-1] = ProteinTools.createProtein(args[a]);

       // store length and composition of each protein
       int[] lengths = new int[db.ids().size()];
       int[][] counts = new int[lengths.length][res.length];
       float[][] means = new float[lengths.length][res.length];

       // iterate over proteins in Fasta file
       SequenceIterator sI = db.sequenceIterator();
       for (int s = 0; sI.hasNext(); s++) {
         Sequence seq = sI.nextSequence();
         lengths[s] = seq.length();

         // iterate over each amino acid
         for (Object sr : seq.toList())

           // check for amino acid in each residue group
           for (int a = 1; a < args.length; a++)

             // iterate over each residue in group
             for (Object r : res[a-1].toList())

               // increment count if amino acid has a match in residue group
               if (((Symbol) r).getMatches().contains((Symbol) sr)) {
                 counts[s][a-1]++;
                 break;
               }

         // print "name length: composition" for each protein
         System.out.print(seq.getName() + "\t" + seq.length() + ":");
         for (int a = 1; a < args.length; a++)
           System.out.print("\t" + (means[s][a-1] = (float) counts[s][a-1] / 
lengths[s]));
         System.out.println();
       }

     } catch (FileNotFoundException ex) {
       System.err.println("Problem reading file...");
       ex.printStackTrace();
     } catch (BioException ex) {
       System.err.println("File not in fasta format or wrong alphabet...");
       ex.printStackTrace();
     } catch (NoSuchElementException ex) {
       System.err.println("No fasta sequences in the file...");
       ex.printStackTrace();
     }
   }

}

On 4/3/2010 5:18 AM, Radwen Aniba wrote:
> Hello,
>
> I'm writing an application that treats protein sequences, and I am using
> Biojava for a couple of things.
> One of these processings is to parse protein multifasta files, and treat the
> sequences one after the other. One of my purposes is to calculate
> composition. By composition I mean that I am interested to know in a given
> protein sequence what is the mean and the standard deviation composition of
> these groups :
>
> PAGST
> EDNQ
> LIVM
> KRH
> C
>
> example :
>
> protein fasta file :
>
>> SEQ1
>
> DVSFRLSGATSSSYGVFISNLRKALPNERKLYDIPLLRSSLPGSQRYALI
> HLTNYADETISVAIDVTNVYIMGYRAGDTSYFFNEASATEAAKYVFKDAM
> RKVTLPYSGNYERLQTAAGKIRENIPLGLPALDSAITTLFYYNANSAASA
> LMVLIQSTSEAARYKFIEQQIGKRVDKTFLPSLAIISLENSWSALSKQIQ
> IASTNNGQFESPVVLINAQNQRVTITNVDAGVVTSNIALLLNRNNMA
>
>> SEQ2
>
> IFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTGADVRHEIPVLPNRVG
> LPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQED
> AEAITHLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISA
> LYYYSTGGTQLPTLARSFIICIQMISEAARFQYIEGEMRTRIRYNRRSAP
> DPSVITLENSWGRLSTAIQESNQGAFASPIQLQRRNGSKFSVYDVSILIP
> IIALMVYRCAPPPSSQF
>
> I would like to
> 1/ parse SEQ1 to calculate the composition mean of PAGST residues for
> example ( number of residus/ length of the sequence)
> 2/ do same thing for SEQ2
> 3 / return the average mean of both sequences
> 4/ Return standard deviation of these values.
>
>
> I can do it writing a standard java code, but I would like to know (as I am
> using biojava already) if this is possible or not ( Which class / instances
> to use)
>
> Cheers
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev