[Bioperl-l] Entrez Gene and bioperl-db

Peter Robinson Peter.Robinson at t-online.de
Mon Jan 17 06:06:02 EST 2005


Hi list,

here's an update on Entrez Gene. 
1) NCBI apparently does not have plans to offer the files in XML format
for FTP download. It is possible to download the files in XML format
from the website, even including the files for the entire species with
corresponding queries (although I havent tried this yet). It seems this
might be too complicated for many users and there could be issues of
stability for browsers downloading files of that size.


2) I have completed two reasonably simple modules for parsing gene_info
and gene2accession using the SeqIO interface. These are attached
together with simple demo programs. These modules can be used to do some
useful things. For instance, we often want to generate a list of
correspondences between NCBI accession numbers and MGI accession numbers
so as to be able to use MGI's Gene Ontology annotations for the mouse.I
have included a script (accession2mgi.pl) that uses the above modules to
parse gene_info and gene2accession to do this (you need to use both
files)

3) In the meantime I have also gotten a lex/yacc parser in C to parse
the species-specific Gene files (which is by far the most interesting
file in the Entrez gene system). In principle this approach could be
done in Perl -- straightforward but a lot of detail work. I will be
needing this kind of thing for my work, so I will continue to work on
this, and once it is bug-free in C I will think about ways of porting it
to Bioperl (this might take a while). As I mentioned before on this
list, if anybody else can do this more quickly please go ahead (but drop
me a line); on the other hand, collaborators who like the idea of
writing a grammer in the style of lex/yacc or ANTLR are also welcome.

--peter


On Tue, 2005-01-11 at 02:33, Chris Mungall wrote: 
> Hi Peter
> 
> Have you tried asking NCBI to make XML available as well as ASN? In
> general they seem keen to offer both for most of their datasets. If not, I
> believe the NCBI toolkit has an ASN->XML converter.
> 
> Cheers
> Chris
> 
> On Thu, 6 Jan 2005, Peter Robinson wrote:
> 
> > Dear Bioperlers,
> >
> > I have started looking at writing some modules to parse the new Entrez
> > gene, which is kind of an expanded LocusLink. The really interesting
> > files are species specific and are in the ASN.1 format, and I am still
> > experimenting around with the best way of parsing them. To get started,
> > I am looking at the tab-delimited flat files. It seems to me that it
> > would be interesting to be able to parse gene_info and gene2accession
> > using the Bio::SeqIO system, the other files such as gene2unigene seem
> > less suited for this (the latter has just two entries which could be
> > parsed ad hoc easily enough).
> >
> > In any case, I am sending a proposed module Bio::SeqIO::geneinfo.pm as
> > well as a test script (which contains a small excerpt of gene_info in
> > the data section) for comments and criticism to the list. I am presently
> > working on another module for Bio::SeqIO::gene2accession and plan to
> > write a demo script using both modules to convert NCBI accession numbers
> > to MGI accession numbers (which is something one might want to do in
> > order to use Gene Ontology for affymetrix data, although one needs
> > additional work for probesets which are only related to ESTs).
> >
> > For the moment it seemed better to just parse in the NCBI taxon id into
> > the Bio::Species object (only this info is supplied by gene_info), and
> > expect users who need the information to use the taxonomy support of
> > other Bioperl modules in their scripts.
> >
> > I will continue to work on parsing the species specific ASN.1 files, but
> > I will be trying a combination of lex/yacc/C to do this. If that works I
> > will look into trying perl support for lex/yacc for potential use in
> > Bioperl, but since I am not sure how long this will take me, I do not
> > want to scare off anyone else who would like to give this a shot.
> >
> > best,
> > peter
> >
> >
> > On Tue, 2005-01-04 at 22:03, Jason Stajich wrote:
> > > On Jan 4, 2005, at 3:52 PM, Peter Robinson wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > thanks for the advice. It seems as if the documentation of
> > > > Bio::DB::Taxonomy is a bit out of sync.
> > > >  my $db = new Bio::DB::Taxonomy(-source => 'flatfile'
> > > >                                  -nodesfile => $nodesfile,
> > > >                                  -namesfile => $namefile);
> > > > What does 'flatfile' refer to here? It is not apparent upon looking at
> > > > the code for new.
> > > >
> > > See Bio::DB::Taxonomy::flatfile for more information.  As I mentioned
> > > in the mail I sent, flatfile is for downloading the taxonomy DB from
> > > NCBI.  This lets you run it locally using an indexed  (BerkelyDB via
> > > DB_File) version of the file.
> > >
> > > You must need the most up-to-date verion of the modules - works fine
> > > for me for both the entrez and flatfile code, but you may have to
> > > upgrade off of the 1.4.0 release. Code from CVS or the bioperl-1.5 RC1
> > > code should work fine.
> > >
> > >
> > >
> > > > I had somewhat better luck using the entrez version, but I got a
> > > > pretty amusing error
> > > > message:
> > > >
> > > > MSG: can't create a species object for Homo sapiens (human) because it
> > > > isn't a species but is a '' instead
> > > >
> > > > ###
> > > > Full error and a dump of the script follow:
> > > >
> > > > my $db = new Bio::DB::Taxonomy(-source => 'entrez'); #
> > > > my $taxaid = $db->get_taxonid('Homo sapiens');
> > > > my $species = $db->get_Taxonomy_Node(-taxonid => '9606');
> > > > print Dumper($species);
> > > >
> > > > ###
> > > >
> > > > Use of uninitialized value in string eq at
> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
> > > > Use of uninitialized value in sprintf at
> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
> > > >
> > > > -------------------- WARNING ---------------------
> > > > MSG: can't create a species object for Homo sapiens (human) because it
> > > > isn't a species but is a '' instead
> > > > ---------------------------------------------------
> > > > Use of uninitialized value in string eq at
> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
> > > > Use of uninitialized value in sprintf at
> > > > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
> > > >
> > > > -------------------- WARNING ---------------------
> > > > MSG: can't create a species object for Homo sapiens (human) because it
> > > > isn't a species but is a '' instead
> > > > ---------------------------------------------------
> > > > $VAR1 = {
> > > >           'TaxId' => '9606',
> > > >           'Division' => 'mammals',
> > > >           'GeneNumber' => '32775',
> > > >           'Rank' => 'species',
> > > >           'ProtNumber' => '247791',
> > > >           'ScientificName' => 'Homo sapiens',
> > > >           'CommonName' => 'human',
> > > >           'NucNumber' => '9025800',
> > > >           'GenNumber' => '25',
> > > >           'StructNumber' => '5638'
> > > >         };
> > > > peter at anna:~/programs/bioperlTest$
> > > >
> > > >
> > > > --best, peter
> > > >
> > > > On Mon, 2005-01-03 at 23:51, Jason Stajich wrote:
> > > >> Bio::DB::Taxonomy is the factory code - it is pretty easy to get a
> > > >> species object (or equivalent) using this code.  But you cannot (or
> > > >> could not when I wrote this, not sure of the current status) get the
> > > >> full classification from the NCBI taxonomy retrieval via cgi.  i.e.
> > > >> you
> > > >> can only get genus and species for a taxon id and I don't know how to
> > > >> walk up the hierarchy using the web API.  Earlier emails to NCBI
> > > >> seemed
> > > >> to indicate this is all they intended to provide, but not sure what
> > > >> the
> > > >> current status is.
> > > >>
> > > >>   my $db = new Bio::DB::Taxonomy(-source => 'entrez'); # use NCBI
> > > >> Entrez
> > > >> over HTTP
> > > >>    my $taxaid = $db->get_taxonid('Homo sapiens');
> > > >>    my $taxonnode = $db->get_Taxonomy_Node(-taxonid => '9606');
> > > >>
> > > >> You can get the full classification if you use the
> > > >> Bio::DB::Taxonomy::flatfile factory which requires you to have
> > > >> downloaded the taxonomy db flatfile from NCBI.  Since this is more
> > > >> reliable (and faster) it is what I have tended to use for grouping
> > > >> sets
> > > >> of seqDB search results, etc.
> > > >>
> > > >> -jason
> > > >> On Jan 3, 2005, at 5:40 PM, Peter Robinson wrote:
> > > >>
> > > >>> Hi Bioperlers, hi Hilmar,
> > > >>>
> > > >>> after some thinking I have embarked on a lex/yacc parser for the
> > > >>> Entrez
> > > >>> Gene ASN.1 format as the way of least resistance, although I am not
> > > >>> sure
> > > >>> how that would fit in to BioPerl. If anyone is interested in this (or
> > > >>> has a better idea of how to go about it..), please drop me a line.
> > > >>>
> > > >>> In the meantime I have been looking at writing code to parse some of
> > > >>> the
> > > >>> "easy" Entrez gene documents, starting off with gene_info. This file
> > > >>> includes the NCBI taxon id for each entry. I would like to convert
> > > >>> this
> > > >>> to a Bio::Species object to pass to the following
> > > >>> 	my $seq = $self->sequence_factory->create(
> > > >>> 			     -verbose => $self->verbose(),
> > > >>> 			     -accession_number => $geneID,
> > > >>> 			     -desc => $description,
> > > >>> 			     -display_id => $symbol,
> > > >>> 			     -species =>  ???
> > > >>> 			     -annotation => $ann);
> > > >>>
> > > >>> and saw the Bio::Taxonomy::FactoryI code, which appears to want to do
> > > >>> this sort of thing. However, the code for that is pretty preliminary.
> > > >>> Is
> > > >>> anyone working on this at the moment? Or is there a better way of
> > > >>> doing
> > > >>> this (it seems a shame not to provide the actual species name if one
> > > >>> has
> > > >>> the taxid...)
> > > >>>
> > > >>> best
> > > >>>
> > > >>> Peter
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, 2004-12-28 at 07:17, Hilmar Lapp wrote:
> > > >>>> Great to hear that someone is giving this a shot. Yes at this point
> > > >>>> is
> > > >>>> appears that NCBI is only offering the ASN.1, not a conversion to
> > > >>>> XML.
> > > >>>> Their asn2xml tool will not work with this ASN.1 format either, just
> > > >>>> checked it to be sure. They do seem to be mulling the option of XML
> > > >>>> though on the Gene FAQ. Maybe if enough people get in their ears
> > > >>>> they
> > > >>>> will spend some effort towards that. After all, the entrez gene web
> > > >>>> interface can display XML on demand - even though it looks fairly
> > > >>>> hideous.
> > > >>>>
> > > >>>> There is no ASN.1 support in bioperl at all. Also, ASN.1 support in
> > > >>>> perl is actually thin - there is Convert::ASN1 at version 0.18 two
> > > >>>> years ago that I could find ... doesn't make me feel warm and fuzzy.
> > > >>>>
> > > >>>> In the absence of any XML available from NCBI, gene_info might be
> > > >>>> the
> > > >>>> best start. An option could be to check for the presence of the
> > > >>>> other
> > > >>>> tab-delimited files and use those that are present. These are
> > > >>>> tab-delimited and hence the format itself is trivial so you can
> > > >>>> focus
> > > >>>> entirely on setting up a Bio::Seq plus annotation that's
> > > >>>> comparable/compatible to what the current SeqIO::locuslink does.
> > > >>>>
> > > >>>> My $0.02 (worth less and less almost every day).
> > > >>>>
> > > >>>> 	-hilmar
> > > >>>>
> > > >>>> On Thursday, December 23, 2004, at 10:51  AM, Peter Robinson wrote:
> > > >>>>
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> I have been thinking about given a BioPerl EntrezGene parser a try
> > > >>>>> since
> > > >>>>> I have been a heavy user of locus link to date. One issue is that
> > > >>>>> the
> > > >>>>> files that correspond to LL_tmpl (which was a flat file) are now in
> > > >>>>> asn
> > > >>>>> format
> > > >>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/help/
> > > >>>>> genehelp.html#query
> > > >>>>> Although I saw some mention of ASN support in Bioperl by googling,
> > > >>>>> I
> > > >>>>> can't seem to find any module that does this in the present
> > > >>>>> distribution. What is the status on that? In any case, I will be
> > > >>>>> working
> > > >>>>> on this in the next month or two and if anything nice comes of it I
> > > >>>>> will
> > > >>>>> send it to you / BioPerpl.
> > > >>>>>
> > > >>>>> best wishes & happy holidays
> > > >>>>>
> > > >>>>> Peter
> > > >>>>>
> > > >>>>> On Tue, 2004-12-14 at 09:00, Hilmar Lapp wrote:
> > > >>>>>> Since load_seqdatabase.pl will use bioperl's SeqIO parsers for
> > > >>>>>> parsing
> > > >>>>>> any input file, what you're asking is whether or not there is a
> > > >>>>>> SeqIO
> > > >>>>>> parser for NCBI Gene.
> > > >>>>>>
> > > >>>>>> The answer to that question is no, not yet. Anybody who feels
> > > >>>>>> motivated
> > > >>>>>> is welcome to give it a try ... Since I'll need it, I'll write the
> > > >>>>>> parser if nobody else does within the next 3 months, but I'm not
> > > >>>>>> going
> > > >>>>>> to promise when exactly this will happen.
> > > >>>>>>
> > > >>>>>> 	-hilmar
> > > >>>>>>
> > > >>>>>> On Monday, December 13, 2004, at 08:03  AM, Law, Annie wrote:
> > > >>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> I was wondering with regards to bioperl-db the scripts and schema
> > > >>>>>>> and
> > > >>>>>>> load_seqdatabase.pl has there been preparation for integration of
> > > >>>>>>> Entrez
> > > >>>>>>> gene information when locuslink is phased out?  Or if it has
> > > >>>>>>> already
> > > >>>>>>> been
> > > >>>>>>> changed could somebody point
> > > >>>>>>> me to the documentation or changed code?
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Annie.
> > > >>>>>>> _______________________________________________
> > > >>>>>>> Bioperl-l mailing list
> > > >>>>>>> Bioperl-l at portal.open-bio.org
> > > >>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>> --
> > > >>>>> Peter N. Robinson
> > > >>>>> peter.robinson at t-online.de
> > > >>>>> peter.robinson at charite.de
> > > >>>>> http://www.charite.de/ch/medgen/robinson/
> > > >>>>>
> > > >>>>>
> > > >>> --
> > > >>> Peter N. Robinson
> > > >>> peter.robinson at t-online.de
> > > >>> peter.robinson at charite.de
> > > >>> http://www.charite.de/ch/medgen/robinson/
> > > >>>
> > > >>> _______________________________________________
> > > >>> Bioperl-l mailing list
> > > >>> Bioperl-l at portal.open-bio.org
> > > >>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > >>>
> > > >>>
> > > >> --
> > > >> Jason Stajich
> > > >> jason.stajich at duke.edu
> > > >> http://www.duke.edu/~jes12/
> > > > --
> > > > Peter N. Robinson
> > > > peter.robinson at t-online.de
> > > > peter.robinson at charite.de
> > > > http://www.charite.de/ch/medgen/robinson/
> > > >
> > > >
> > > --
> > > Jason Stajich
> > > jason.stajich at duke.edu
> > > http://www.duke.edu/~jes12/
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >

-- 
Peter N. Robinson
peter.robinson at t-online.de
peter.robinson at charite.de
http://www.charite.de/ch/medgen/robinson/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: accession2mgi.pl
Type: application/x-perl
Size: 2507 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050117/2c5de30b/accession2mgi-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gene2accession.pm
Type: application/x-perl
Size: 8148 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050117/2c5de30b/gene2accession-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gene2accession_test.pl
Type: application/x-perl
Size: 5968 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050117/2c5de30b/gene2accession_test-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: geneinfo.pm
Type: application/x-perl
Size: 10515 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050117/2c5de30b/geneinfo-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: geneinfotest.pl
Type: application/x-perl
Size: 11225 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050117/2c5de30b/geneinfotest-0001.bin


More information about the Bioperl-l mailing list