[Bioperl-l] EntrezGene ASN parser

Wed Mar 30 17:31:44 EST 2005

I just finished a Bioperl EntrezGene Parser based on Mingyi Liu's ASN 
Gene parser. It creates two main objects: a Bio::Seq object which 
contains most of the data such as references, description, map location, 
etc; and a Bio::Cluster::SequenceFamily object, which contains the 
refseqs and the gene structure (through NT/NC annotation, represented as 
Bio::SeqFeature::Gene objects).  Another data I make available is the 
uncaptured data. So each time a some data is transfered from the hash 
which represents the parsed data, I am deleting the respective  key. 
Everything else is concidered uncaptured. I am doing this since some 
records could be non-compliant or simply there may be new data supplied 
by NCBI. There will be naturally some data, which is not interesting, 
and therefore is not captured (a lot of redundant data in the 
EntrezGene). So the parser would act like that:
my ($egene,$assoc_seq,$uncaptured)=$egparser->next_seq;
There are few things I need to add (Markers and GO are not yet in these 
objects), but most of work is done. Unless somebody objects, I will 
commit the code (Bio::SeqIO::entrezgene?) when I write the documentation 
to match the standard.
Few notes:
1. It would be nice if there is Bio::Annotation::DBLink::url method. It 
makes sense (I think) since most DB links would refer also to a webpage.
2. It takes now 45 minutes to parse the whole human ASN file, which is 4 
times slower. Keeping uncaptured data slows things down a bit, so I will 
introduce -debug option. Anyway I think the speed is not going to be an 
issue.
3. Due to the cyclic reference in the GeneStructure object I am removing 
the Transcript->{parent} in the parser. This code should be deleted once 
the Transcript object is fixed.
There are also some other minor issues, but I think I will be able to 
fix them by the end of the week.
Please let me know what you think.
Stefan