(Modest) software for DNA data
Will Fischer
wfischer@sunflower.bio.indiana.edu
Tue, 18 Feb 1997 10:25:02 -0500
Thought this might be of interest to the members of this list.
There is no doubt much duplicatoin of effort here, but the
functions are among those that should be included in the
bioperl modules.
If I only had the time to do it right, I wouldnt have to do it over.
-- WF
Release Announcement: (Modest) Perl programs for molecular sequence data
I often extract data from GenBank, to make nucleotide and
protein alignments (for primer design and for phylogenetic analysis).
Some perl programs I wrote to ease this task are now available
for public use.
Where they are:
http://www.bio.indiana.edu/~wfischer/Perl_Scripts/
What they do:
1. parse features or whole entries from files of GenBank entries;
2. translate DNA sequences into amino-acid sequences;
3. make DNA alignments based on amino-acid alignments.
Complete descriptions are available at the above URL.
Details:
Parsing features from GenBank files
"gbparse"
extracts either whole genbank entries matching a pattern,
or new entries consisting of only those subsequences
specified by a FEATURE; I use it mostly to extract coding
sequences from files of entries retrieved from NCBI, but
the code is general. It will concatenate exons and complement
sequence as specified in the FEATURE table.
Generating amino-acid translations from DNA sequences
"nt2aa" (NucleoTide to AminoAcid)
reads genbank, fasta, or GCG format files (or raw sequence data)
and produces an amino acid sequence in any or all frames,
using any of the genetic codes defined by NCBI.
It will translate degenerate codons as far as possible (which
GCG's "translate" will not), and present a list of possibile
amino acids if desired. Output is either raw or fasta format.
Step Three: Generating an amino-acid alignment
You're on your own here: use GCG's pileup, or clustalw, or
mase, or macaw, or seaview, or whatever you like.
Save the output as fasta, or genbank, or Don Gilbert's
excellent "readseq" program (q.v.) is installed, any format
that it can handle.
Step Four: Aligning sequence data to the amino-acid alignment
"align2aa"
reads two inout files containing DNA and amino-acid sequences (and an
optional names file if the names are different), and inserts
gaps in the DNA sequences (via reverse-translation) to match
the alignment of the corresponding amino-acid sequences. Takes
input in fasta-format, or genbank, or anything "readseq"
can handle (see above); writes fasta-format output.
Disclaimers:
These perl programs were written to serve my own needs;
they are not sterling examples of good coding practice
(indeed, they may shock or amuse those who write elegant code).
I am making these programs available in the hope that they may be
useful; I cannot guarantee that they will not corrupt your data (so
keep backup copies). Nonetheless, if you find a problem, I will
apologize to you and (my time permitting, and given enough detail)
fix the problem that caused you trouble. No other warranties,
expressed or implied, apply.
The code is not in the public domain. It may not be sold, or used in a
product which is sold, without the express consent of the author.
____________________________________________________________
Will Fischer
Biology Department wfischer@indiana.edu
Jordan Hall http://www.bio.indiana.edu/~wfischer
Indiana University Lab: 812-855-2549
Bloomington, Indiana 47405 USA FAX: 812-855-6705