From aksrao at ucdavis.edu Thu May 30 17:47:04 2019 From: aksrao at ucdavis.edu (Anandkumar Surendrarao) Date: Thu, 30 May 2019 13:47:04 -0400 Subject: [EMBOSS] EMBOSS for protein alignment stats Message-ID: Greetings EMBOSS users! I have ~ 18000 files, each with clustal formatted protein alignments derived from Pfam-A.full. Some of these files are large > 500MB in size, the largest alignment is 3GB! I need to calculate the following alignment statistics A. average aligned length B. std. dev. of aligned length C. average of pairwise sequence ID % D. std. dev. of pairwise sequence ID % Here are my 2 problems that I seek help with: 1. I can calculate A and C using alistat that comes with UBUNTU, but not B or D. 2. For the really large alignments, there is no option due to RAM requirements, and so I've used alistat's -f (fast) option, which estimates average %id by "sampling" If EMBOSS has tools / tricks to report A - D, while having reasonable RAM and disk-usage footprints, and quick processing times, please let me know. I am open to suggestions regarding other tools as well. I look forward to your replies. Thanks, in advance. Sincerely, Anand -------------- next part -------------- An HTML attachment was scrubbed... URL: From idoerg at gmail.com Thu May 30 18:10:43 2019 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 30 May 2019 13:10:43 -0500 Subject: [EMBOSS] EMBOSS for protein alignment stats In-Reply-To: References: Message-ID: infoalign should give you what you want. It does not do the summary statistics, but for each sequence it gives the alignment length and the %ID (note that %ID can mean several things!) You can then programmatically parse whose numbers to calculate mean and standard deviation. http://emboss.sourceforge.net/apps/cvs/emboss/apps/infoalign.html#output.8 Iddo On Thu, May 30, 2019 at 12:48 PM Anandkumar Surendrarao wrote: > Greetings EMBOSS users! > > I have ~ 18000 files, each with clustal formatted protein alignments > derived from Pfam-A.full. > Some of these files are large > 500MB in size, the largest alignment is > 3GB! > > I need to calculate the following alignment statistics > A. average aligned length > B. std. dev. of aligned length > C. average of pairwise sequence ID % > D. std. dev. of pairwise sequence ID % > > Here are my 2 problems that I seek help with: > 1. I can calculate A and C using alistat that comes with UBUNTU, but not B > or D. > 2. For the really large alignments, there is no option due to RAM > requirements, and so I've used alistat's -f (fast) option, which estimates > average %id by "sampling" > > If EMBOSS has tools / tricks to report A - D, while having reasonable RAM > and disk-usage footprints, and quick processing times, please let me know. > > I am open to suggestions regarding other tools as well. > I look forward to your replies. Thanks, in advance. > > Sincerely, > Anand > _______________________________________________ > EMBOSS mailing list > EMBOSS at mailman.open-bio.org > https://mailman.open-bio.org/mailman/listinfo/emboss -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. -------------- next part -------------- An HTML attachment was scrubbed... URL: