[Bioperl-l] EntrezGene ASN parser
Stefan Kirov
skirov at utk.edu
Wed Mar 30 17:31:44 EST 2005
I just finished a Bioperl EntrezGene Parser based on Mingyi Liu's ASN
Gene parser. It creates two main objects: a Bio::Seq object which
contains most of the data such as references, description, map location,
etc; and a Bio::Cluster::SequenceFamily object, which contains the
refseqs and the gene structure (through NT/NC annotation, represented as
Bio::SeqFeature::Gene objects). Another data I make available is the
uncaptured data. So each time a some data is transfered from the hash
which represents the parsed data, I am deleting the respective key.
Everything else is concidered uncaptured. I am doing this since some
records could be non-compliant or simply there may be new data supplied
by NCBI. There will be naturally some data, which is not interesting,
and therefore is not captured (a lot of redundant data in the
EntrezGene). So the parser would act like that:
my ($egene,$assoc_seq,$uncaptured)=$egparser->next_seq;
There are few things I need to add (Markers and GO are not yet in these
objects), but most of work is done. Unless somebody objects, I will
commit the code (Bio::SeqIO::entrezgene?) when I write the documentation
to match the standard.
Few notes:
1. It would be nice if there is Bio::Annotation::DBLink::url method. It
makes sense (I think) since most DB links would refer also to a webpage.
2. It takes now 45 minutes to parse the whole human ASN file, which is 4
times slower. Keeping uncaptured data slows things down a bit, so I will
introduce -debug option. Anyway I think the speed is not going to be an
issue.
3. Due to the cyclic reference in the GeneStructure object I am removing
the Transcript->{parent} in the parser. This code should be deleted once
the Transcript object is fixed.
There are also some other minor issues, but I think I will be able to
fix them by the end of the week.
Please let me know what you think.
Stefan
More information about the Bioperl-l
mailing list