[Bioperl-l] Re: entrezgene binary ASN

Fri Sep 30 12:23:34 EDT 2005

I was half way through adding the support for pipe in 
Bio::ASN1::EntrezGene before I realized that this is not a good 
solution.  The problem I have with the pipe thing is that it merely 
added more troubles and did not really save anything.

I mean, one superficial advantage of using pipe directly would be that 
you don't need to first launch gene2xml.  But 1. Nobody needs to 
manually launch gene2xml.  In any shell/perl script that does the 
automatic download of the NCBI binary ASN files, just add a line to 
launch gene2xml right after download.  2. Having EntrezGene module deal 
with it transparently would force it to deal with multiple failure 
possibilities (no gene2xml installed? gene2xml choked? ...), let alone 
hassles of changing syntax in input_file.  Simply put, it's not worth it.

Another proposed advantage is saving disk I/O, in a sense it does (the 
gzipped binary files are much smaller), but that does not necessarily 
lead to shorter processing time since the time gene2xml doing its work 
on the fly should be counted as well.  Not to mention if gene2xml choked 
for whatever reason.

A major disadvantage of using pipe would be doing any sort of seeking 
operation on the file - the performance would be abysmal.  For indexing 
and indexed entry retrieval, one simply have to do the pre-conversion of 
those binary gzipped files.

As such I feel there are compelling reasons for one to first convert the 
binary gzip files to text files, then use the existing Bioperl modules 
to parse, index, retrieve.  Any further input/discussions on the matter 
is welcomed!

Thanks,

Mingyi

Michael Seewald wrote:

>Hi Stefan,
>
>There are ways to capture these errors. Perl exception handling might
>be way to do it.
>
>On the other hand: Wouldn"t incomplete .gz downloads throw an error
>right away? I have to check (but can't right now).
>
>Michael
>