[Biopython] fetching chromosome IDs given the organism ID

Mon Nov 21 22:13:00 UTC 2011

Thanks for pointing in the right direction.

I also e-mailed Entrez help desk about this problem.

It seems like their advise is to go through "nuccore". However, to restrict
the results they suggest to use "srcdb_refseq[prop]" in the query line.

http://www.ncbi.nlm.nih.gov/nuccore?term=%22Cyanothece%20sp.%20ATCC%2051142%22[orgn]%20AND%20srcdb_refseq[prop<http://www.ncbi.nlm.nih.gov/nuccore?term=%22Cyanothece%20sp.%20ATCC%2051142%22%5borgn%5d%20AND%20srcdb_refseq%5bprop>
]

That works well for some not-extensively studied organisms such a
Cyanothece and returns the right number of records, which is 6.

But for human it returns 62051 records instead of 25 (chromosomes +
mitochondrial DNA).

http://www.ncbi.nlm.nih.gov/nuccore?term=%22homo%20sapiens%22[orgn]%20AND%20srcdb_refseq[prop<http://www.ncbi.nlm.nih.gov/nuccore?term=%22homo%20sapiens%22%5borgn%5d%20AND%20srcdb_refseq%5bprop>
]

After tuning the query a little bit this one seems like giving a reasonable
results

(("homo sapiens"[orgn] AND srcdb_refseq[prop]) AND 168[BioProject]) NOT
patches NOT contig

Tweaking nuccore queries seems like a hack rather then a solution.

There better be a straight relationships between the databases: Genome ->
Genome Project -> Nuccore

This query returns the right thing, but in HTML format (even if
&rettype=gb).

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&term=59013

or

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&term=168

I guess (hope) that GenBank format is something that will be added in the
future, unless I am overlooking something.

Cheers,

Vlad

On Mon, Nov 21, 2011 at 8:52 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, Nov 17, 2011 at 11:09 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk <petyuk at gmail.com>
> wrote:
> >> I am trying to fetch the chromosome IDs for a given genome.
> >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids
> >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2
> >> The piece of Biopython code that used to work for me is:
> >> #---------------------
> >> url = Entrez.esearch(db="genome", term="txid43989")
> >> record = Entrez.read(url)
> >> chromosomeIDs = record["IdList"]
> >> #---------------------
> >> Not anymore. Now it returns the organism id, which is 1608.
> >
> > That's annoying of the NCBI to change things.
> >
>
> The NCBI have just made a public announcement by email today
> (21 Nov 2011), and apologized for the lack of notice:
>
>
> http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000083.html
>
> Judging from the URL it was also on their news page the day you
> found the problem, but I hadn't seen that then:
>
> http://www.ncbi.nlm.nih.gov/About/news/17Nov2011.html
>
> It looks like a sensible long term change to the genome database.
>
> Regards,
>
> Peter
>