[Biopython] fetching chromosome IDs given the organism ID
Vladislav Petyuk
petyuk at gmail.com
Mon Nov 21 22:13:00 UTC 2011
Thanks for pointing in the right direction.
I also e-mailed Entrez help desk about this problem.
It seems like their advise is to go through "nuccore". However, to restrict
the results they suggest to use "srcdb_refseq[prop]" in the query line.
http://www.ncbi.nlm.nih.gov/nuccore?term=%22Cyanothece%20sp.%20ATCC%2051142%22[orgn]%20AND%20srcdb_refseq[prop<http://www.ncbi.nlm.nih.gov/nuccore?term=%22Cyanothece%20sp.%20ATCC%2051142%22%5borgn%5d%20AND%20srcdb_refseq%5bprop>
]
That works well for some not-extensively studied organisms such a
Cyanothece and returns the right number of records, which is 6.
But for human it returns 62051 records instead of 25 (chromosomes +
mitochondrial DNA).
http://www.ncbi.nlm.nih.gov/nuccore?term=%22homo%20sapiens%22[orgn]%20AND%20srcdb_refseq[prop<http://www.ncbi.nlm.nih.gov/nuccore?term=%22homo%20sapiens%22%5borgn%5d%20AND%20srcdb_refseq%5bprop>
]
After tuning the query a little bit this one seems like giving a reasonable
results
(("homo sapiens"[orgn] AND srcdb_refseq[prop]) AND 168[BioProject]) NOT
patches NOT contig
Tweaking nuccore queries seems like a hack rather then a solution.
There better be a straight relationships between the databases: Genome ->
Genome Project -> Nuccore
This query returns the right thing, but in HTML format (even if
&rettype=gb).
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&term=59013
or
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&term=168
I guess (hope) that GenBank format is something that will be added in the
future, unless I am overlooking something.
Cheers,
Vlad
On Mon, Nov 21, 2011 at 8:52 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> On Thu, Nov 17, 2011 at 11:09 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk <petyuk at gmail.com>
> wrote:
> >> I am trying to fetch the chromosome IDs for a given genome.
> >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids
> >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2
> >> The piece of Biopython code that used to work for me is:
> >> #---------------------
> >> url = Entrez.esearch(db="genome", term="txid43989")
> >> record = Entrez.read(url)
> >> chromosomeIDs = record["IdList"]
> >> #---------------------
> >> Not anymore. Now it returns the organism id, which is 1608.
> >
> > That's annoying of the NCBI to change things.
> >
>
> The NCBI have just made a public announcement by email today
> (21 Nov 2011), and apologized for the lack of notice:
>
>
> http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000083.html
>
> Judging from the URL it was also on their news page the day you
> found the problem, but I hadn't seen that then:
>
> http://www.ncbi.nlm.nih.gov/About/news/17Nov2011.html
>
> It looks like a sensible long term change to the genome database.
>
> Regards,
>
> Peter
>
More information about the Biopython
mailing list