[Bioperl-l] sets of sequences - how to read?

Mon Jun 17 07:37:43 UTC 2013

On 17 May 2013 05:08, Fields, Christopher J <cjfields at illinois.edu> wrote:
> On May 15, 2013, at 8:53 PM, Carnë Draug <carandraug+dev at gmail.com> wrote:
>> Hi
>>
>> when accessing entrez gene using eutils to get multiple genes, NCBI
>> now returns an Entrezgene-Set[1] rather than a list of EntrezGene.
>> This change must have happened sometime on the last 2 months.
>>
>> [...]
>>
>> Carnë
>>
>> [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html
>
> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land.  I guess that would be... <looks at watch>... now.

Hi,

for those interested, I have contacted NCBI about this and they have
reverted the change (see conversation below). Still, entrezgene-set is
a thing so the issue of reading such things still exists.

Carnë

---------- Forwarded message ----------
Date: 17 May 2013 00:36
Subject: Entrezegene-Set: recent changes to E-utilities

Hi

I believe there was a recent change to the E-utilities service. When
fetching multiple ASN1 entrezegene records from the gene database, it
now returns an Entrezgene-Set instead of the typical list of
Entrezgene records, one after the other.

For example, here's an example Entrezgene-Set:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=3014,85235&rettype=asn1&retmode=text

which used to be the same as a concatenation of:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=3014&rettype=asn1&retmode=text
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=85235&rettype=asn1&retmode=text

This is something new. I don't know exactly when it was introduced but
must have been sometime in the last 2 months.

I don't know about other programming languages, but at least in Perl
there is no module able to parse this files. I have already contacted
the author of the module responsible for reading the non-set
Entrezgene with a patch but who knows when it will made available. The
only workaround is to make multiple requests, one for each UID, which
will obviously annoy your servers.

As far as I am aware, there was no notification of this change to
E-utilities, which worked fine for many years. We did have a lot of
code that worked fine for years, until it started to fail last month.
And no one using perl will be able to parse them until a fix is
released. Is there anyway this change can be reverted?

---------- Forwarded message ----------
Date: 23 May 2013 04:53
Subject: Re: Entrezegene-Set: recent changes to E-utilities

Thanks very much for your report. I will discuss this with the Gene
development team to see why this change occurred and get back to you.
Out of curiosity, have you considered using the XML format for Gene
(&retmode=xml)? There are a variety of XML parsers for Perl that should be
able to read Gene XML.

---------- Forwarded message ----------
Date: 24 May 2013 13:01
Subject: Re: Entrezegene-Set: recent changes to E-utilities

thank you for looking into this.

While there are several XML parsers for perl, there is not one that
will return a Bio::Seq object (a Bio::SeqIO compliant). Of course I
could use one of the XML parsers to write write my own but then I
could as well fix the entrezgene parser to deal with Entrezgene-sets
which is what I'm doing. I already proposed a patch to them but the
inclusion of a new concept, of a set of sequences, does not really fit
in the design of Bio::Seq.

Please do let me know of more news on this. Thank you again,

---------- Forwarded message ----------
Date: 13 June 2013 22:08
Subject: Re: Entrezegene-Set: recent changes to E-utilities

The fix for this should now be live. Let us know if you have further
problems with this.

Regards,