[Biopython] Access Entrez gene DB using rettype 'gb'
Brad Chapman
chapmanb at 50mail.com
Fri Dec 10 13:10:26 UTC 2010
Michiel, David and Peter;
> > That's right. The gene database doesn't give back the "native" XML
> > format which Entrez.read deals with. Instead it's got custom XML output:
> >
> > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html#gene
>
> Really? What's the difference between "native" and "custom" XML? The
> Entrez Gene XML example from this linked gets parsed perfectly fine by
> Entrez.read.
Nice one. I seriously underestimated the power of your code to deal
with that nastiness. Here's an updated version that uses Entrez.read
instead of the custom XML parsing with ElementTree:
https://gist.github.com/727625
from Bio import Entrez
def fetch_gene_coordinates(search_term):
handle = Entrez.esearch(db="gene", term=search_term)
rec = Entrez.read(handle)
gene_id = rec["IdList"][0] # assuming best match works
handle = Entrez.efetch(db="gene", id=gene_id, retmode="xml")
rec = Entrez.read(handle)[0]
gene_locus = rec["Entrezgene_locus"][0]
region = gene_locus["Gene-commentary_seqs"][0]["Seq-loc_int"]["Seq-interval"]
start = int(region["Seq-interval_from"]) + 1
end = int(region["Seq-interval_to"]) + 1
gi_id = region["Seq-interval_id"]["Seq-id"]["Seq-id_gi"]
strand = region["Seq-interval_strand"]["Na-strand"].attributes["value"]
return gi_id, start, end, strand
def get_fasta_seq(gi_id, start, end, strand):
strand = 2 if strand.lower() == "minus" else 1
handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=gi_id,
seq_start=start, seq_stop=end, strand=strand)
return handle.read()
Entrez.email = "yours at mail.com"
search_term = "fliC ct18"
gi_id, start, end, strand = fetch_gene_coordinates(search_term)
print get_fasta_seq(gi_id, start, end, strand)
Brad
More information about the Biopython
mailing list