[Biopython] biopython module for variant descriptions?
Peter Cock
p.j.a.cock at googlemail.com
Thu Nov 2 10:40:40 EDT 2023
Thanks David,
That fits with my understanding. I don't work with model organisms, so while
I have used plenty of SAM/BAM files and occasionally fought with VCF files,
I've not worked with the HGVS specification.
Peter
On Thu, Nov 2, 2023 at 12:47 PM David Merberg <merbergd at gmail.com> wrote:
> I guess there is some relationship because they both deal with alterations
> compared to a reference sequence.
>
> To my knowledge, the Variant Call Format, or VCF, file is generally used
> in the context of an NGS experiment. Generating a vcf file is the next step
> after a bam file. The bam file contains the alignment of each sequencing
> read to the reference collection, then the vcf file summarizes the
> differences.
>
> The HGVS mutation description is usually used in a more low-throughput
> context. So for example if you’re studying a disease known to be associated
> with mutations in a specific gene, then you might describe the mutations
> using the HGVS specification.
>
> So, for example, cystic fibrosis is caused by the G542X mutation (i.e.
> Glycine 542 is changed to Termination) in the cystic fibrosis transmembrane
> regulator. If you go to the gnomad database and search for the IDS gene,
> you get a table with many variants of this gene that cause Hunter Syndrome,
> e.g.:
> c.1650T>C
> c.1648C>T
> c.1645A>G
> c.1644G>T
> c.1642T>C
> c.1637A>G
> c.1636C>T
> p.Pro550Pro
> p.Pro550Ser
> p.Met549Val
> p.Leu548Phe
> p.Leu548Leu
> p.Gln546Arg
> p.Gln546Ter
> c.1181-32_1181-16dup
> c.1181-83_1181-73del
>
> There are 1608 rows in this table for the IDS gene.
>
> If a new mutation is described in the literature it will (should be)
> specified in HGVS format. In many older papers that is not the case.
>
> Some of the things you might want to do with these HGVS variant
> descriptions are:
> 1. Given the standard (i.e. reference) sequence for a gene and a variant,
> what is the sequence of the mutated gene?
> 2. Given the gene sequence and the HGVS description of the DNA change,
> what is the protein change?
> 3. Given just the protein change, what are the possible DNA changes that
> could cause it?
> 4. Given just the DNA change and reference sequence, is it a missense or
> nonsense mutation?
> 5. Given a variant description, is it consistent with the reference
> sequence? For example, in the CFTR case mentioned above G542X is a mutation
> found in the literature. If I am collecting data and I see a mutation
> described as T542X it is wrong. There is no T at position 542 of CFTR. I
> would determine that by checking the CFTR sequence.
>
> In general, I think of VCF as part of a NGS workflow, while HGVS is used
> further downstream in structure-function and genotype-phenotype discussions.
>
> I hope that helps clarify.
>
> It would have helped me to find a biopython module that would instantiate
> classes and subclasses of mutations/variants and provide some basic
> methods. I know that there are other scientists asking the same sorts of
> questions, but I don’t know whether any are attempting to answer them by
> writing python programs.
>
> Dave
>
>
>
> On November 1, 2023 at 3:36:16 PM, Peter Cock (p.j.a.cock at googlemail.com)
> wrote:
>
> I don't think we have anything like this (yet). Are efforts like VCF
> (variant call format) related but separate in your mind?
>
> Peter
>
> On Tue, Oct 31, 2023 at 7:31 PM David Merberg <merbergd at gmail.com> wrote:
>
>> Hello biopython world,
>>
>> For my last job, I wrote some python code to categorize and describe
>> sequence changes of many types. I used biopython to handle sequences and
>> some basic functions like IO and translation, but I did not find a module
>> for reading variants/mutants and applying them to sequences.
>>
>> Some cases are trivial, but some are not. For example, a small deletion
>> in the nucleotide sequence may have no effect on the amino acid
>> corresponding to the position of the affected codon, but will affect
>> downstream amino acids. Protein changes caused by deletions or insertions
>> of 3, 6, 9 . . . nucleotides can also be tricky to calculate.
>>
>> My question is whether there is a biopython module to read variants in a
>> standard format (see for example http://varnomen.hgvs.org/)? Along with
>> the variant objects there could be a set of methods to operate on mutated
>> sequences. Does the community think that this would be useful if it does
>> not already exist?
>>
>> I implemented many functions for these sorts of operations, but I
>> realized soon afterwards that there are probably better ways to do much of
>> it. I always wanted to redo the work, but never had time. Now I have time,
>> but am not at that job. If it would be useful to the community, I may be
>> able to take it on as a contribution to biopython.
>>
>> A caveat is that I don’t have experience contributing to multi-developer
>> projects. I try to write clean, well documented code and I’m familiar with
>> the basics of git. So, it’s OK if you’d prefer that I start with something
>> smaller (like unit tests or documentation). Just let me know.
>>
>> Dave Merberg
>>
>> _______________________________________________
>> Biopython mailing list - Biopython at biopython.org
>> https://mailman.open-bio.org/mailman/listinfo/biopython
>>
> _______________________________________________
> Biopython mailing list - Biopython at biopython.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20231102/8f5d12e3/attachment-0001.htm>
More information about the Biopython
mailing list