[EMBOSS] EMBOSS for protein alignment stats

Anandkumar Surendrarao aksrao at ucdavis.edu
Thu May 30 17:47:04 UTC 2019


Greetings EMBOSS users!

I have ~ 18000 files, each with clustal formatted protein alignments
derived from Pfam-A.full.
Some of these files are large > 500MB in size, the largest alignment is 3GB!

I need to calculate the following alignment statistics
A. average aligned length
B. std. dev. of aligned length
C. average of pairwise sequence ID %
D. std. dev. of pairwise sequence ID %

Here are my 2 problems that I seek help with:
1. I can calculate A and C using alistat that comes with UBUNTU, but not B
or D.
2. For the really large alignments, there  is no option due to RAM
requirements, and so I've used alistat's -f  (fast) option, which estimates
average %id by "sampling"

If EMBOSS has tools / tricks to report A - D, while having reasonable RAM
and disk-usage footprints, and quick processing times, please let me know.

I am open to suggestions regarding other tools as well.
I look forward to your replies. Thanks, in advance.

Sincerely,
Anand
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/emboss/attachments/20190530/9e2bb089/attachment.htm>


More information about the EMBOSS mailing list