[Bioperl-l] sets of sequences - how to read?

Carnë Draug carandraug+dev at gmail.com
Thu May 16 01:53:55 UTC 2013


Hi

when accessing entrez gene using eutils to get multiple genes, NCBI
now returns an Entrezgene-Set[1] rather than a list of EntrezGene.
This change must have happened sometime on the last 2 months. Compare:

use Bio::DB::EUtilities;

my %sets = (
  eutil   => 'efetch',
  db      => 'gene',
  retmode => 'text',
  rettype => 'asn1',
  email   => 'bioperl-l at lists.open-bio.org',
);

## this mimics the previous behaviour of the NCBI server but the
multiple requests will annoy their servers
my @ids = (3014, 85235);
my $response;
foreach (@ids) {
  my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_);
  $response .= $fetcher->get_Response->content;
}
print $fetcher->get_Response->content;

## this used to be the right way to do it, but now returns an Entrezgene-Set
my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids);
$response .= $fetcher->get_Response->content;
print $fetcher->get_Response->content;

There is no module to read these Entrezgene-Set in Perl at the moment,
since Bio::ASN1::EntrezGene; is not able to handle them. I have
contacted the module author and set him a fix[2] and he said he'll try
to look into it next week.

However, even with the fix there is another problem. How would one
access a set of sequences using the Bio::SeqIO API? There is no method
to do that. One could say, to ignore them, and make next_seq return
the next sequence of the set. But then we are losing data. After all,
it's perfectly viable to have multiple Entrezgene-Set in one file.
What would be the right way to do this?

Carnë

[1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html
[2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b




More information about the Bioperl-l mailing list