[Biopython] translating 454 data with frameshifts
Jessica Grant
jgrant at smith.edu
Fri Dec 10 14:59:38 UTC 2010
We have some transcriptome 454 data and quite simply we are trying to
build a protein database from the nucleotide sequences. The problem
comes in that there are quite a lot of frameshifts in our contig
assemblies--and in the original sequences as well.
We have a list of the best blastx hit for each sequence, and I have tried
1 - blasting each sequence against its best hit
2 - taking the hsp_qseqs from the blast output
3 - sticking them together, in order, if there is more than one hsp.
This has worked for many of the sequences but sometimes there are
overlapping "best hsp_qseqs" and when I stick them together I get a
long made-up protein. Also, for some sequences, the qseq goes past
the point where the alignment should stop and then when I stick them
together I get a few extra amino acids in my protein that ought not
to be there.
Frank Kauff told me that bioperl has a "tile_hsp" function, but
before I try understanding how that works in a language I am not
familiar with, I thought I would ask here to see if anyone knows of a
way to do this in python.
Is there a smart way to concatenate hsps in biopython? Does anyone
have a better idea about how to build a protein database from 454
data?
Thank you!
Jessica
More information about the Biopython
mailing list