[Bioperl-l] Help on a basic EST-genomic alignment script

Fri Jun 25 22:52:10 EDT 2004

On Fri, 25 Jun 2004, Edward Chuong wrote:

> Hi all,
>
> I'm a very new perl/bioperl user, and I'm running into some trouble so
> I hope I can find some help here.
>
> Here's what I need to do:
>
> (from a large number of fasta files) Extract an EST (peromyscus) from
> the FASTA file, find its closest match in mus musclus, and find the
> dn/ds between the two sequences.
>
> Here's what I'm doing (I know this is probably the least efficient
> way, any suggestions?):
>
> 1) Read in pero EST from a FASTA
> 2) Standaloneblast it to local mus cDNA database, retrieve accession
> from best result

 You know if you already have a fasta file with the ests you don't really
have to run these individually or even with StandAloneBlast and it will
be more efficient to run the search at once and have the report file
ready to parse.

Just do
 blastall -i ests.fa -d mus -p blastn -e evalue ...

Although I think you might do better to run a translated search against
the mouse protein set.  You also will find you can get better results with
FASTX/FASTY as it allows frameshifts whereas blast will only search one
frame at a time.

> 3) Retrieve complete mus sequence with features from genbank using ID from (2)
> 4) Make a clustalw simple align object using the mus protein sequence
> from (3) against the translated pero EST for all 3 frames, and keep
> the one with the best identity %.
> --I'm done up to here--

Why not determine the best frame when doing the search by comparing to the
mouse proteins?

> 5) Convert the aln frrom AA to DNA (there is a builtin aa_to_dna_aln
> but it isn't working for me)

Show the code if you want specific help I guess.

You pass in a protein alignment (Bio::SimpleAlign) and a hash reference
of Bio::Seq objects which are cDNA and keyed on the names of each of the
proteins in the alignment.
  See the script in scripts/utilities/pairwise_kaks.PLS for an example
and working code.

> 6) Pass the aln through a DN/DS module (is paml the only one?)

Depends on how important it is for you to the best answer...
You can use Bio::Align::DNAStatistics calc_KaKs_pair for pairwise
Nei-Gojobori count.  You can use YN00 or Codeml in PAML for .

>
> So I have 2 problems: 7 out of the 20 ESTs return a poor "best
> alignment" with less than 20% identity (and in the printout they
> clearly are not aligning). Does this have something to do with gaps?
>
> In spite of that, the other 13 are aligning pretty well, with 60-100%
> alignment. In order to calculate DN/DS I've looked around and it seems
> I have to use PAML. But before that I think I'm required to have an
> aln object of two DNA sequences, starting at the correct frame. How
> can I do that?
>

You basically need to predict the protein sequence from the EST - you can
use estwise to do this based on the best mouse protein homolog.

-jason

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu