alignment sequence reading with stop codons (bug?)

Jason Stajich jason at cgt.mc.duke.edu
Thu Dec 20 15:12:48 UTC 2001


On Thu, 20 Dec 2001, David Bauer wrote:

> Hi,
>
> the protein alignment programs don't like the '*' in your protein
> sequences. They are designed to align true proteins which usualy do not
> contain stop codons.
> If this are putative ORFs, a solution would be to split them up at the
> stops, creating a separate protein sequence for each ORF.
>

Re-aligning blastx hsps in some distant fungi so am hitting pseudogenes or
sequencing errors, hence the stop codons.

What is confusing to me wrt to the actual alignment programs, is if they
don't like stop codons at all, they still allow an alignment when the
sequence containing the stop codon is the query (-sequencea) but not when
the sequence is in the subject db - ie the behavior in my previous msg.
I may just recode the stop codons as an unknown aa to achieve what I need
for the alignment.  I realize it is silly to try and align these proteins
with stop codons but I am looking for conserved regions for degenerate PCR
primer picking.

[jason at gordola crypto_intergenic]$ head -6 contig5745.water
Local: Contig5745 vs SW-CC27_YEAST
Score: 367.50

Contig5745      1        CLIF*RLLLIQMI.HPQARRAFTFLQQQEPYRIQSMEQLSTLLWH 44
                         ||:    |  ::| :  : : |  |:  :| |:: ||  ||||||
SW-CC27_YEAST   474      CLVQLGKLHFEIINYDMSLKYFNRLKDLQPARVKDMEIFSTLLWH 518

> I also guess you are misinterpreting the -seqall. This means to return
> all sequences from a file containing more than one sequence (like a
> fasta formated file with several sequences separated by theire
> description lines). For me the -seqall option does not make much sense
> in the case of alignment programs which need exactly 2 sequences to
> align.
> There you must always pass the two sequence files which you want align
> as arguments to the alignment program and each file must contain exactly
> one sequence.
>

In the alignment program context -seqall is the name of the db to search
the query (-sequencea) against - so one will get an alignment of the first
sequence against the whole db of sequences.  I am only interested in 1
pairwise comparison so the order of the sequences didn't really matter to
me.  We have a SW alignment module in bioperl (written in C - before you
gag) for protein alignments but was trying out our new EMBOSS wrappers in
bioperl, hence the reported issue.

> I hope this helps,
>
> David Bauer.
>
>
> Jason Stajich wrote:
> >
> > I noticed this in playing with our new bioperl wrappers for EMBOSS.
> > Apparently -seqall does not read sequences with stop codons.
> > I can submit as a bug if that is more appropriate.  Getting warmed up to
> > the EMBOSS dev process.
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu






More information about the EMBOSS mailing list