[Biopython] Converting transcript coordinates to genome coordinates
Reece Hart
reece at harts.net
Fri Jul 28 14:30:43 UTC 2017
Projecting/mapping locations and variants between genome, transcript, and
protein coordinates is conceptually easy and practically straightforward in
most cases. Unfortunately, there are diabolical real-world challenges. Here
are a few off the cuff, roughly in decreasing order of difficulty:
* Source of alignments. Projecting between genome and transcript
coordinates requires alignments. For the last ~2 years, NCBI has released
alignments (as gff files) for *current* transcripts to *current*
assemblies. That works great for what it is, but you're stuck if you need
historical data (e.g., a variant from an old paper).
* Genome-transcript discrepancies. Genome and transcript sequences are from
different sources. Their alignments are littered with natural polymorphisms
as well as sequencing errors (typically genomic). As a result, accurate
projections in these regions requires being aware of indels.
* Mutable alignments. Surprisingly, the alignments of a transcript and
genome is not necessarily stable at NCBI. Changes happen very rarely, but
it's worth knowing that these do change and nothing about the process
prevents such degeneracy. This mutability has terrible (hypothetical)
consequences for reliably communicating variation and interpreting
consequences (like coding v. non-coding). A related issue is that UCSC uses
blat to generate alignments, which sometimes differs significantly from the
splign alignments. (2014 presentation on slideshare
<https://www.slideshare.net/reecehart/hvp-2014-clinical-significance-of-transcript-alignment-discrepancies>
)
* Alignment representation. This is an extreme corner case that results
from regions of misassembly. For nearly all exons in all transcripts, it's
a terrific and convenient assumption that alignments are exon-wise with
exon-wise cigar strings. For a small number of transcripts (see
biocommons/uta#198
<https://github.com/biocommons/uta/issues/198#issuecomment-273269902>), the
alignments are discontiguous, with unaligned regions possible in transcript
or genome sequence. While CIGARs have an N (typically for introns), there's
no symmetric operator for unaligned sequence in transcripts, which makes it
difficult to represent such regions.
The last case is definitely in the weeds, but exemplifies what you find
when trying to do this at genome scales.
-Reece
On Fri, Jul 28, 2017 at 6:42 AM, Peter Cock <p.j.a.cock at googlemail.com>
wrote:
> Hi Lenna,
>
> Thank you!
>
> Sorry for the delay - the CodeFest and BOSC plus ISMB/ECCB
> was a fun but busy trip.
>
> I know a fair amount about sequence coordinates, but have
> rarely needed to map between protein or gene coordinates
> and genomic coordinates. I would suggest waiting for user
> feedback to identify any missing functionality.
>
> We can include the new code with an experimental warning
> if you feel there is a high chance of the API needing changes?
>
> Regards,
>
> Peter
>
> On Thu, Jul 20, 2017 at 1:13 PM, Lenna Peterson
> <lenna.peterson at gmail.com> wrote:
> > I am willing (and available, now that I've finished my PhD!) to work on
> > getting the code into the main branch. The main barrier is that I am not
> > familiar with the finer details of sequence coordinates, so I would
> benefit
> > from guidance from a sequence expert for adding missing functionality.
> >
> > Lenna
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20170728/ede7a4e4/attachment.html>
More information about the Biopython
mailing list