[Bioperl-l] Entrez Gene and bioperl-db

Sean Davis sdavis2 at mail.nih.gov
Mon Jan 17 09:09:37 EST 2005


Peter,

Thanks for doing all this!

Just a bit more on an update.  I checked with some folks in our (NHGRI) 
bioinformatics core.  It sounds like the closest thing to XML that NCBI 
might offer would be an ASN.1 to XML converter and NOT the xml files, as 
Peter already stated.  They have one (like for public consumption) that 
works for each ASN.1 file except for the gene files.  There is no definite 
date for completion as far as I know.  They have also mentioned a bulk ASN.1 
to XML web-based tool, but I agree with Peter that this will have 
significant limitations for "online" use for large datasets like 
human/mouse/rat (but might work well with a user agent).

Sean

----- Original Message ----- 
From: "Peter Robinson" <Peter.Robinson at t-online.de>
To: "Bioperl list" <bioperl-l at portal.open-bio.org>
Cc: "Peter Robinson" <Peter.Robinson at t-online.de>
Sent: Monday, January 17, 2005 6:06 AM
Subject: Re: [Bioperl-l] Entrez Gene and bioperl-db


> Hi list,
>
> here's an update on Entrez Gene.
> 1) NCBI apparently does not have plans to offer the files in XML format
> for FTP download. It is possible to download the files in XML format
> from the website, even including the files for the entire species with
> corresponding queries (although I havent tried this yet). It seems this
> might be too complicated for many users and there could be issues of
> stability for browsers downloading files of that size.
>
>
> 2) I have completed two reasonably simple modules for parsing gene_info
> and gene2accession using the SeqIO interface. These are attached
> together with simple demo programs. These modules can be used to do some
> useful things. For instance, we often want to generate a list of
> correspondences between NCBI accession numbers and MGI accession numbers
> so as to be able to use MGI's Gene Ontology annotations for the mouse.I
> have included a script (accession2mgi.pl) that uses the above modules to
> parse gene_info and gene2accession to do this (you need to use both
> files)
>
> 3) In the meantime I have also gotten a lex/yacc parser in C to parse
> the species-specific Gene files (which is by far the most interesting
> file in the Entrez gene system). In principle this approach could be
> done in Perl -- straightforward but a lot of detail work. I will be
> needing this kind of thing for my work, so I will continue to work on
> this, and once it is bug-free in C I will think about ways of porting it
> to Bioperl (this might take a while). As I mentioned before on this
> list, if anybody else can do this more quickly please go ahead (but drop
> me a line); on the other hand, collaborators who like the idea of
> writing a grammer in the style of lex/yacc or ANTLR are also welcome.
>
> --peter
>
>
> On Tue, 2005-01-11 at 02:33, Chris Mungall wrote:
>> Hi Peter
>>
>> Have you tried asking NCBI to make XML available as well as ASN? In
>> general they seem keen to offer both for most of their datasets. If not, 
>> I
>> believe the NCBI toolkit has an ASN->XML converter.
>>
>> Cheers
>> Chris
>>
>> On Thu, 6 Jan 2005, Peter Robinson wrote:
>>
>> > Dear Bioperlers,
>> >
>> > I have started looking at writing some modules to parse the new Entrez
>> > gene, which is kind of an expanded LocusLink. The really interesting
>> > files are species specific and are in the ASN.1 format, and I am still
>> > experimenting around with the best way of parsing them. To get started,
>> > I am looking at the tab-delimited flat files. It seems to me that it
>> > would be interesting to be able to parse gene_info and gene2accession
>> > using the Bio::SeqIO system, the other files such as gene2unigene seem
>> > less suited for this (the latter has just two entries which could be
>> > parsed ad hoc easily enough).
>> >
>> > In any case, I am sending a proposed module Bio::SeqIO::geneinfo.pm as
>> > well as a test script (which contains a small excerpt of gene_info in
>> > the data section) for comments and criticism to the list. I am 
>> > presently
>> > working on another module for Bio::SeqIO::gene2accession and plan to
>> > write a demo script using both modules to convert NCBI accession 
>> > numbers
>> > to MGI accession numbers (which is something one might want to do in
>> > order to use Gene Ontology for affymetrix data, although one needs
>> > additional work for probesets which are only related to ESTs).
>> >
>> > For the moment it seemed better to just parse in the NCBI taxon id into
>> > the Bio::Species object (only this info is supplied by gene_info), and
>> > expect users who need the information to use the taxonomy support of
>> > other Bioperl modules in their scripts.
>> >
>> > I will continue to work on parsing the species specific ASN.1 files, 
>> > but
>> > I will be trying a combination of lex/yacc/C to do this. If that works 
>> > I
>> > will look into trying perl support for lex/yacc for potential use in
>> > Bioperl, but since I am not sure how long this will take me, I do not
>> > want to scare off anyone else who would like to give this a shot.
>> >
>> > best,
>> > peter
>> >
>> >
>> > On Tue, 2005-01-04 at 22:03, Jason Stajich wrote:
>> > > On Jan 4, 2005, at 3:52 PM, Peter Robinson wrote:
>> > >
>> > > > Hi Jason,
>> > > >
>> > > > thanks for the advice. It seems as if the documentation of
>> > > > Bio::DB::Taxonomy is a bit out of sync.
>> > > >  my $db = new Bio::DB::Taxonomy(-source => 'flatfile'
>> > > >                                  -nodesfile => $nodesfile,
>> > > >                                  -namesfile => $namefile);
>> > > > What does 'flatfile' refer to here? It is not apparent upon looking 
>> > > > at
>> > > > the code for new.
>> > > >
>> > > See Bio::DB::Taxonomy::flatfile for more information.  As I mentioned
>> > > in the mail I sent, flatfile is for downloading the taxonomy DB from
>> > > NCBI.  This lets you run it locally using an indexed  (BerkelyDB via
>> > > DB_File) version of the file.
>> > >
>> > > You must need the most up-to-date verion of the modules - works fine
>> > > for me for both the entrez and flatfile code, but you may have to
>> > > upgrade off of the 1.4.0 release. Code from CVS or the bioperl-1.5 
>> > > RC1
>> > > code should work fine.
>> > >
>> > >
>> > >
>> > > > I had somewhat better luck using the entrez version, but I got a
>> > > > pretty amusing error
>> > > > message:
>> > > >
>> > > > MSG: can't create a species object for Homo sapiens (human) because 
>> > > > it
>> > > > isn't a species but is a '' instead
>> > > >
>> > > > ###
>> > > > Full error and a dump of the script follow:
>> > > >
>> > > > my $db = new Bio::DB::Taxonomy(-source => 'entrez'); #
>> > > > my $taxaid = $db->get_taxonid('Homo sapiens');
>> > > > my $species = $db->get_Taxonomy_Node(-taxonid => '9606');
>> > > > print Dumper($species);
>> > > >
>> > > > ###
>> > > >
>> > > > Use of uninitialized value in string eq at
>> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
>> > > > Use of uninitialized value in sprintf at
>> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>> > > >
>> > > > -------------------- WARNING ---------------------
>> > > > MSG: can't create a species object for Homo sapiens (human) because 
>> > > > it
>> > > > isn't a species but is a '' instead
>> > > > ---------------------------------------------------
>> > > > Use of uninitialized value in string eq at
>> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
>> > > > Use of uninitialized value in sprintf at
>> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>> > > >
>> > > > -------------------- WARNING ---------------------
>> > > > MSG: can't create a species object for Homo sapiens (human) because 
>> > > > it
>> > > > isn't a species but is a '' instead
>> > > > ---------------------------------------------------
>> > > > $VAR1 = {
>> > > >           'TaxId' => '9606',
>> > > >           'Division' => 'mammals',
>> > > >           'GeneNumber' => '32775',
>> > > >           'Rank' => 'species',
>> > > >           'ProtNumber' => '247791',
>> > > >           'ScientificName' => 'Homo sapiens',
>> > > >           'CommonName' => 'human',
>> > > >           'NucNumber' => '9025800',
>> > > >           'GenNumber' => '25',
>> > > >           'StructNumber' => '5638'
>> > > >         };
>> > > > peter at anna:~/programs/bioperlTest$
>> > > >
>> > > >
>> > > > --best, peter
>> > > >
>> > > > On Mon, 2005-01-03 at 23:51, Jason Stajich wrote:
>> > > >> Bio::DB::Taxonomy is the factory code - it is pretty easy to get a
>> > > >> species object (or equivalent) using this code.  But you cannot 
>> > > >> (or
>> > > >> could not when I wrote this, not sure of the current status) get 
>> > > >> the
>> > > >> full classification from the NCBI taxonomy retrieval via cgi. 
>> > > >> i.e.
>> > > >> you
>> > > >> can only get genus and species for a taxon id and I don't know how 
>> > > >> to
>> > > >> walk up the hierarchy using the web API.  Earlier emails to NCBI
>> > > >> seemed
>> > > >> to indicate this is all they intended to provide, but not sure 
>> > > >> what
>> > > >> the
>> > > >> current status is.
>> > > >>
>> > > >>   my $db = new Bio::DB::Taxonomy(-source => 'entrez'); # use NCBI
>> > > >> Entrez
>> > > >> over HTTP
>> > > >>    my $taxaid = $db->get_taxonid('Homo sapiens');
>> > > >>    my $taxonnode = $db->get_Taxonomy_Node(-taxonid => '9606');
>> > > >>
>> > > >> You can get the full classification if you use the
>> > > >> Bio::DB::Taxonomy::flatfile factory which requires you to have
>> > > >> downloaded the taxonomy db flatfile from NCBI.  Since this is more
>> > > >> reliable (and faster) it is what I have tended to use for grouping
>> > > >> sets
>> > > >> of seqDB search results, etc.
>> > > >>
>> > > >> -jason
>> > > >> On Jan 3, 2005, at 5:40 PM, Peter Robinson wrote:
>> > > >>
>> > > >>> Hi Bioperlers, hi Hilmar,
>> > > >>>
>> > > >>> after some thinking I have embarked on a lex/yacc parser for the
>> > > >>> Entrez
>> > > >>> Gene ASN.1 format as the way of least resistance, although I am 
>> > > >>> not
>> > > >>> sure
>> > > >>> how that would fit in to BioPerl. If anyone is interested in this 
>> > > >>> (or
>> > > >>> has a better idea of how to go about it..), please drop me a 
>> > > >>> line.
>> > > >>>
>> > > >>> In the meantime I have been looking at writing code to parse some 
>> > > >>> of
>> > > >>> the
>> > > >>> "easy" Entrez gene documents, starting off with gene_info. This 
>> > > >>> file
>> > > >>> includes the NCBI taxon id for each entry. I would like to 
>> > > >>> convert
>> > > >>> this
>> > > >>> to a Bio::Species object to pass to the following
>> > > >>> my $seq = $self->sequence_factory->create(
>> > > >>>      -verbose => $self->verbose(),
>> > > >>>      -accession_number => $geneID,
>> > > >>>      -desc => $description,
>> > > >>>      -display_id => $symbol,
>> > > >>>      -species =>  ???
>> > > >>>      -annotation => $ann);
>> > > >>>
>> > > >>> and saw the Bio::Taxonomy::FactoryI code, which appears to want 
>> > > >>> to do
>> > > >>> this sort of thing. However, the code for that is pretty 
>> > > >>> preliminary.
>> > > >>> Is
>> > > >>> anyone working on this at the moment? Or is there a better way of
>> > > >>> doing
>> > > >>> this (it seems a shame not to provide the actual species name if 
>> > > >>> one
>> > > >>> has
>> > > >>> the taxid...)
>> > > >>>
>> > > >>> best
>> > > >>>
>> > > >>> Peter
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> On Tue, 2004-12-28 at 07:17, Hilmar Lapp wrote:
>> > > >>>> Great to hear that someone is giving this a shot. Yes at this 
>> > > >>>> point
>> > > >>>> is
>> > > >>>> appears that NCBI is only offering the ASN.1, not a conversion 
>> > > >>>> to
>> > > >>>> XML.
>> > > >>>> Their asn2xml tool will not work with this ASN.1 format either, 
>> > > >>>> just
>> > > >>>> checked it to be sure. They do seem to be mulling the option of 
>> > > >>>> XML
>> > > >>>> though on the Gene FAQ. Maybe if enough people get in their ears
>> > > >>>> they
>> > > >>>> will spend some effort towards that. After all, the entrez gene 
>> > > >>>> web
>> > > >>>> interface can display XML on demand - even though it looks 
>> > > >>>> fairly
>> > > >>>> hideous.
>> > > >>>>
>> > > >>>> There is no ASN.1 support in bioperl at all. Also, ASN.1 support 
>> > > >>>> in
>> > > >>>> perl is actually thin - there is Convert::ASN1 at version 0.18 
>> > > >>>> two
>> > > >>>> years ago that I could find ... doesn't make me feel warm and 
>> > > >>>> fuzzy.
>> > > >>>>
>> > > >>>> In the absence of any XML available from NCBI, gene_info might 
>> > > >>>> be
>> > > >>>> the
>> > > >>>> best start. An option could be to check for the presence of the
>> > > >>>> other
>> > > >>>> tab-delimited files and use those that are present. These are
>> > > >>>> tab-delimited and hence the format itself is trivial so you can
>> > > >>>> focus
>> > > >>>> entirely on setting up a Bio::Seq plus annotation that's
>> > > >>>> comparable/compatible to what the current SeqIO::locuslink does.
>> > > >>>>
>> > > >>>> My $0.02 (worth less and less almost every day).
>> > > >>>>
>> > > >>>> -hilmar
>> > > >>>>
>> > > >>>> On Thursday, December 23, 2004, at 10:51  AM, Peter Robinson 
>> > > >>>> wrote:
>> > > >>>>
>> > > >>>>> Hi,
>> > > >>>>>
>> > > >>>>> I have been thinking about given a BioPerl EntrezGene parser a 
>> > > >>>>> try
>> > > >>>>> since
>> > > >>>>> I have been a heavy user of locus link to date. One issue is 
>> > > >>>>> that
>> > > >>>>> the
>> > > >>>>> files that correspond to LL_tmpl (which was a flat file) are 
>> > > >>>>> now in
>> > > >>>>> asn
>> > > >>>>> format
>> > > >>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/help/
>> > > >>>>> genehelp.html#query
>> > > >>>>> Although I saw some mention of ASN support in Bioperl by 
>> > > >>>>> googling,
>> > > >>>>> I
>> > > >>>>> can't seem to find any module that does this in the present
>> > > >>>>> distribution. What is the status on that? In any case, I will 
>> > > >>>>> be
>> > > >>>>> working
>> > > >>>>> on this in the next month or two and if anything nice comes of 
>> > > >>>>> it I
>> > > >>>>> will
>> > > >>>>> send it to you / BioPerpl.
>> > > >>>>>
>> > > >>>>> best wishes & happy holidays
>> > > >>>>>
>> > > >>>>> Peter
>> > > >>>>>
>> > > >>>>> On Tue, 2004-12-14 at 09:00, Hilmar Lapp wrote:
>> > > >>>>>> Since load_seqdatabase.pl will use bioperl's SeqIO parsers for
>> > > >>>>>> parsing
>> > > >>>>>> any input file, what you're asking is whether or not there is 
>> > > >>>>>> a
>> > > >>>>>> SeqIO
>> > > >>>>>> parser for NCBI Gene.
>> > > >>>>>>
>> > > >>>>>> The answer to that question is no, not yet. Anybody who feels
>> > > >>>>>> motivated
>> > > >>>>>> is welcome to give it a try ... Since I'll need it, I'll write 
>> > > >>>>>> the
>> > > >>>>>> parser if nobody else does within the next 3 months, but I'm 
>> > > >>>>>> not
>> > > >>>>>> going
>> > > >>>>>> to promise when exactly this will happen.
>> > > >>>>>>
>> > > >>>>>> -hilmar
>> > > >>>>>>
>> > > >>>>>> On Monday, December 13, 2004, at 08:03  AM, Law, Annie wrote:
>> > > >>>>>>
>> > > >>>>>>> Hi,
>> > > >>>>>>>
>> > > >>>>>>> I was wondering with regards to bioperl-db the scripts and 
>> > > >>>>>>> schema
>> > > >>>>>>> and
>> > > >>>>>>> load_seqdatabase.pl has there been preparation for 
>> > > >>>>>>> integration of
>> > > >>>>>>> Entrez
>> > > >>>>>>> gene information when locuslink is phased out?  Or if it has
>> > > >>>>>>> already
>> > > >>>>>>> been
>> > > >>>>>>> changed could somebody point
>> > > >>>>>>> me to the documentation or changed code?
>> > > >>>>>>>
>> > > >>>>>>> Thanks,
>> > > >>>>>>> Annie.
>> > > >>>>>>> _______________________________________________
>> > > >>>>>>> Bioperl-l mailing list
>> > > >>>>>>> Bioperl-l at portal.open-bio.org
>> > > >>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> > > >>>>>>>
>> > > >>>>>>>
>> > > >>>>> --
>> > > >>>>> Peter N. Robinson
>> > > >>>>> peter.robinson at t-online.de
>> > > >>>>> peter.robinson at charite.de
>> > > >>>>> http://www.charite.de/ch/medgen/robinson/
>> > > >>>>>
>> > > >>>>>
>> > > >>> --
>> > > >>> Peter N. Robinson
>> > > >>> peter.robinson at t-online.de
>> > > >>> peter.robinson at charite.de
>> > > >>> http://www.charite.de/ch/medgen/robinson/
>> > > >>>
>> > > >>> _______________________________________________
>> > > >>> Bioperl-l mailing list
>> > > >>> Bioperl-l at portal.open-bio.org
>> > > >>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> > > >>>
>> > > >>>
>> > > >> --
>> > > >> Jason Stajich
>> > > >> jason.stajich at duke.edu
>> > > >> http://www.duke.edu/~jes12/
>> > > > --
>> > > > Peter N. Robinson
>> > > > peter.robinson at t-online.de
>> > > > peter.robinson at charite.de
>> > > > http://www.charite.de/ch/medgen/robinson/
>> > > >
>> > > >
>> > > --
>> > > Jason Stajich
>> > > jason.stajich at duke.edu
>> > > http://www.duke.edu/~jes12/
>> > >
>> > > _______________________________________________
>> > > Bioperl-l mailing list
>> > > Bioperl-l at portal.open-bio.org
>> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> >
>
> -- 
> Peter N. Robinson
> peter.robinson at t-online.de
> peter.robinson at charite.de
> http://www.charite.de/ch/medgen/robinson/
>
>


--------------------------------------------------------------------------------


> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l 




More information about the Bioperl-l mailing list