[Biopython] How to print variants ?

Mon Jan 18 10:05:05 UTC 2010

On Sun, Jan 17, 2010 at 9:48 PM, Anirban Bhattachariya <anbhat at utu.fi> wrote:
> Hi ,
>
> Suppose we want to study how mutations/SNPs affect on binding or some other
> biochemical reaction. Let's also assume, that we have a motif or motifs we want
> to test against These variants are listed in sequence files, there is listed only the
> original protein sequence. For to test motives against variants, we need complete
> protein sequence. Let's say our protein has 75 variants, so we need original + 75
> protein sequences to test with motifs. My intention is to make a list of those 75
> proteins.

>From your earlier emails you are working with a GenBank file for P06276:
http://lists.open-bio.org/pipermail/biopython/2010-January/006120.html
i.e. http://www.ncbi.nlm.nih.gov/protein/116353
or the original SwissProt/UniProt database, as a plain test "swiss" file:
http://www.uniprot.org/uniprot/P06276.txt

Now either the plain text GenBank or SwissProt files are going to force you
to parse strings like "T -> M (in BChE deficiency; dbSNP:rs56309853)." to
pull out this information in a usable form (whichever GenBank or SwissProt
plain text parser you use). This is possible, but a bit fiddly.

Looking at the SwissProt page, they have a table of these variants:
http://www.uniprot.org/uniprot/P06276

UniProt also offer a GFF and FASTA file, neither of which are helpful here:
http://www.uniprot.org/uniprot/P06276.gff
http://www.uniprot.org/uniprot/P06276.fasta

However, the XML format looks much nicer:
http://www.uniprot.org/uniprot/P06276.xml

It has well tagged entries for each variant, e.g.

<feature type="sequence variant" description="In BChE deficiency;
dbSNP:rs56309853." id="VAR_040012">
<original>T</original>
<variation>M</variation>
-
<location>
<position position="52"/>
</location>
</feature>

Note there is some work in development to add parsing these UniProt XML
files to SeqIO as a SeqRecord, but for your task it would probably be simpler
to parse the XML yourself (using one of the standard Python XML libraries)
to pull out just these variations. See also:
http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007244.html

Which would you prefer? Working with XML or fuzzy string formats?

Peter