[Bioperl-l] How (from where) to retrieve FieldInfo objects?

Sat Jul 2 13:22:52 UTC 2011

Just a few notes on this thread (just got back from vacation):

On Jul 1, 2011, at 9:36 PM, Carnë Draug wrote:

> 2011/6/30 Carnë Draug <carandraug+dev at gmail.com>:
>> On 29 June 2011 23:30, Smithies, Russell
>> <Russell.Smithies at agresearch.co.nz> wrote:
>>> How about just returning ASN.1 then parsing that?
>>> There's far more data in that format than any of the others.
>>> 
>>> my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
>>>                                       -term       => 'h2afx[sym] AND human[organism]',
>>>                                       -db         => 'gene',
>>>                                                   -usehistory => 'y');
>>> 
>>> 
>>> my $hist  = $factory->next_History || die "No history data returned";
>>> 
>>> $factory->set_parameters(-eutil   => 'efetch',-history => $hist);
>>> 
>>> print Dumper $factory->get_Response;
>> 
>> When I do this, I get a XML with the ASN.1 inside the tag pre. Is is
>> supposed to be this way? Should I extract it myself? Shouldn't the
>> method do this? It's nice that I can get so many information but
>> wouldn't it be lighter on the NCBI server if I could ask only for the
>> info that I need rather than the whole record?

No, Bio::DB::EUtilities was intentionally designed only to process XML data related specifically to eutil operations, and decouple it from any other format (specifically those from efetch) as there are just too many.  It does not parse any end-point XML/text/ASN.1 like GenBank, Gene XML, ASN.1, etc.  Those should be handled by outside parsers.

There is a Gene ASN.1 parser, by the way: Bio::SeqIO::entrezgene.

> After much work, I've done this and as such I'm sharing back the code
> in case someone comes across it. Basically, get_Response returns a
> HTML::Message object. Since I couldn't find a method to get it pretty,
> I used HTML::Parser to do it. It seems that the ASN.1/entrezgene are
> all inside the <pre> tag. Also, if there's more than one gene, all
> genes are inside the same <pre> tag. Here's the code I used.

More specifically, get_Response returns an HTTP::Response object, hence the name of the method.  The base class is HTTP::Message, not HTML::Message (very important difference, the message doesn't have to be HTML but can be XML, HTML, plain text, etc).

> use Bio::DB::EUtilities;
> use HTML::Parser;
> 
> my @ids = qw(9555 3014);
> my $factory = Bio::DB::EUtilities->new(
>                                      -eutil   => 'efetch',
>                                      -db      => 'gene',
>                                      -id      => \@ids,
>                                      -retmode => 'asn1',
>                                      );
> my $html = $factory->get_Response->content;
> 
> my $parser = HTML::Parser->new(
>                                api_version => 3,
>                                start_h     => [\&handle_start],
>                                end_h       => [\&handle_end],
>                                text_h      => [\&handle_text, 'dtext'],
>                                report_tags => qw(pre),
>                              );
> my $seq;
> {
>  my $inside_tag = 0;
>  sub handle_start {
>    $inside_tag = 1;
>  }
>  sub handle_text {
>    $seq = $_[0] if $inside_tag;
>    return 4;
>  }
>  sub handle_end {
>    $inside_tag = 0;
>  }
> }
> $parser->parse($html);
> 
> After running parse, $seq holds a sequence file that can be opened
> with Bio::SeqIO or written to disk.
> 
> Carnë

The interaction to eutils is supposed to be fairly low-level (in fact, there is a long-overdue refactoring of the internals that simplifies this somewhat, just haven't had time to work on it).  In general, the rettype and retmode params should *both* be set just to be on the safe side.  

For instance, to get plain text ASN.1 (not ASN.1 with HTML tags) use '-rettype' => 'asn1' and '-retmode' => 'text'.  Yeah, one would think it should be easier than that but NCBI has it set up this way.  See the following link for more:

http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html

chris