[Bioperl-l] Entrez Gene and bioperl-db

Sean Davis sdavis2 at mail.nih.gov
Fri Jan 7 06:41:49 EST 2005


I think the power of bioperl is in dealing with entire Entrez Gene 
objects.  For dealing with gene_info, gene2unigene, or generifs files 
in isolation, I'm not sure that an object model is necessary or 
efficient.  However, as many of us do want to deal with Gene objects, I 
think that having a parser that constructs these rich objects is 
important.  That said, I think there may be a NEED two parsers, one for 
species-specific ASN.1 files and one for the tab-delimited files.  The 
ASN.1 parser fits the SeqIO model rather well, I would suppose, but is 
limited by the fact that each species must be downloaded and parsed 
separately.  However, for the vast majority of folks dealing with only 
one or two species, the ease of downloading a single, self-contained 
file for a species or two of interest and passing the file through an 
ASN.1 Gene parser is quite appealing.  Then, for the comparative 
genomicists or those with a need for more than a few species, the 
tab-delimited option could be made available for parsing the text 
files.  Despite my second sentence above, I agree with Stefan that 
having a parser that deals with each text file in isolation (with the 
only required file being gene_info) is quite appealing, allowing the 
user to have a way to choose what files to parse and add to the object. 
  (This is only important because of the number of Gene records and 
needing to complete the parse/object construction in a reasonable 
amount of time.)

I know that having two parsers is not ideal (and that suggesting this 
is a bit of a cop-out), but NCBI has chosen a path that may necessitate 
both solutions to meet the needs of all users.  I would also certainly 
be willing to help out.

Sean

On Jan 7, 2005, at 1:51 AM, Peter Robinson wrote:

> Hi Stefan,
> happy to team up with you for Entrez Gene parsing. Since gene2unigene
> has entries of the form "geneid\tunigeneid", it didnt seem worth the
> trouble putting this information into a Bio::Annotation object in
> isolation. On the other hand, parsing multiple Entrez Gene files at 
> once
> in order to synthesize various forms of infomration about an Entrez 
> Gene
> id seemed to depart from the style of the rest of Bio::SeqIO code.
>
> Suggestions/thoughts, anyone?
>
> -peter
>
> On Fri, 2005-01-07 at 03:33, Stefan A Kirov wrote:
>> Peter,
>> Why unigene can't be added as Bio::Annotation object for example? 
>> Peter,
>> would you mind if I give you a hand, as I am also doing some Entrez 
>> Gene
>> DB parsing.
>> Hilmar,
>> Getting back to your post, I have some concern about automatic
>> parsing of multiple files (if I got this right...). Say if one 
>> downloads
>> the whole Entrez Gene stuff and all is OK I don't see why this can't 
>> be
>> done. But if something goes wrong (and occasionally it will), it will 
>> be
>> really hard for the user to understand he misses parts of the data. Of
>> course this could be done through warnings, but what about people who
>> intentionally parse part of the DB? I guess one can add something like
>> -suppress_warning=>1/0.
>> Another issue that comes to mind is the approach of a stream is fine 
>> for
>> people with the whole DB on their minds. But of you need particular
>> record, I guess you you could index the files, but this totally 
>> different
>> game. Any volunteers?
>>
>>
>> On Thu, 6 Jan 2005, Peter Robinson wrote:
>>
>>> Dear Bioperlers,
>>>
>>> I have started looking at writing some modules to parse the new 
>>> Entrez
>>> gene, which is kind of an expanded LocusLink. The really interesting
>>> files are species specific and are in the ASN.1 format, and I am 
>>> still
>>> experimenting around with the best way of parsing them. To get 
>>> started,
>>> I am looking at the tab-delimited flat files. It seems to me that it
>>> would be interesting to be able to parse gene_info and gene2accession
>>> using the Bio::SeqIO system, the other files such as gene2unigene 
>>> seem
>>> less suited for this (the latter has just two entries which could be
>>> parsed ad hoc easily enough).
>>>
>>> In any case, I am sending a proposed module Bio::SeqIO::geneinfo.pm 
>>> as
>>> well as a test script (which contains a small excerpt of gene_info in
>>> the data section) for comments and criticism to the list. I am 
>>> presently
>>> working on another module for Bio::SeqIO::gene2accession and plan to
>>> write a demo script using both modules to convert NCBI accession 
>>> numbers
>>> to MGI accession numbers (which is something one might want to do in
>>> order to use Gene Ontology for affymetrix data, although one needs
>>> additional work for probesets which are only related to ESTs).
>>>
>>> For the moment it seemed better to just parse in the NCBI taxon id 
>>> into
>>> the Bio::Species object (only this info is supplied by gene_info), 
>>> and
>>> expect users who need the information to use the taxonomy support of
>>> other Bioperl modules in their scripts.
>>>
>>> I will continue to work on parsing the species specific ASN.1 files, 
>>> but
>>> I will be trying a combination of lex/yacc/C to do this. If that 
>>> works I
>>> will look into trying perl support for lex/yacc for potential use in
>>> Bioperl, but since I am not sure how long this will take me, I do not
>>> want to scare off anyone else who would like to give this a shot.
>>>
>>> best,
>>> peter
>>>
>>>
>>> On Tue, 2005-01-04 at 22:03, Jason Stajich wrote:
>>>> On Jan 4, 2005, at 3:52 PM, Peter Robinson wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> thanks for the advice. It seems as if the documentation of
>>>>> Bio::DB::Taxonomy is a bit out of sync.
>>>>>  my $db = new Bio::DB::Taxonomy(-source => 'flatfile'
>>>>>                                  -nodesfile => $nodesfile,
>>>>>                                  -namesfile => $namefile);
>>>>> What does 'flatfile' refer to here? It is not apparent upon 
>>>>> looking at
>>>>> the code for new.
>>>>>
>>>> See Bio::DB::Taxonomy::flatfile for more information.  As I 
>>>> mentioned
>>>> in the mail I sent, flatfile is for downloading the taxonomy DB from
>>>> NCBI.  This lets you run it locally using an indexed  (BerkelyDB via
>>>> DB_File) version of the file.
>>>>
>>>> You must need the most up-to-date verion of the modules - works fine
>>>> for me for both the entrez and flatfile code, but you may have to
>>>> upgrade off of the 1.4.0 release. Code from CVS or the bioperl-1.5 
>>>> RC1
>>>> code should work fine.
>>>>
>>>>
>>>>
>>>>> I had somewhat better luck using the entrez version, but I got a
>>>>> pretty amusing error
>>>>> message:
>>>>>
>>>>> MSG: can't create a species object for Homo sapiens (human) 
>>>>> because it
>>>>> isn't a species but is a '' instead
>>>>>
>>>>> ###
>>>>> Full error and a dump of the script follow:
>>>>>
>>>>> my $db = new Bio::DB::Taxonomy(-source => 'entrez'); #
>>>>> my $taxaid = $db->get_taxonid('Homo sapiens');
>>>>> my $species = $db->get_Taxonomy_Node(-taxonid => '9606');
>>>>> print Dumper($species);
>>>>>
>>>>> ###
>>>>>
>>>>> Use of uninitialized value in string eq at
>>>>> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
>>>>> Use of uninitialized value in sprintf at
>>>>> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>>>>>
>>>>> -------------------- WARNING ---------------------
>>>>> MSG: can't create a species object for Homo sapiens (human) 
>>>>> because it
>>>>> isn't a species but is a '' instead
>>>>> ---------------------------------------------------
>>>>> Use of uninitialized value in string eq at
>>>>> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
>>>>> Use of uninitialized value in sprintf at
>>>>> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>>>>>
>>>>> -------------------- WARNING ---------------------
>>>>> MSG: can't create a species object for Homo sapiens (human) 
>>>>> because it
>>>>> isn't a species but is a '' instead
>>>>> ---------------------------------------------------
>>>>> $VAR1 = {
>>>>>           'TaxId' => '9606',
>>>>>           'Division' => 'mammals',
>>>>>           'GeneNumber' => '32775',
>>>>>           'Rank' => 'species',
>>>>>           'ProtNumber' => '247791',
>>>>>           'ScientificName' => 'Homo sapiens',
>>>>>           'CommonName' => 'human',
>>>>>           'NucNumber' => '9025800',
>>>>>           'GenNumber' => '25',
>>>>>           'StructNumber' => '5638'
>>>>>         };
>>>>> peter at anna:~/programs/bioperlTest$
>>>>>
>>>>>
>>>>> --best, peter
>>>>>
>>>>> On Mon, 2005-01-03 at 23:51, Jason Stajich wrote:
>>>>>> Bio::DB::Taxonomy is the factory code - it is pretty easy to get a
>>>>>> species object (or equivalent) using this code.  But you cannot 
>>>>>> (or
>>>>>> could not when I wrote this, not sure of the current status) get 
>>>>>> the
>>>>>> full classification from the NCBI taxonomy retrieval via cgi.  
>>>>>> i.e.
>>>>>> you
>>>>>> can only get genus and species for a taxon id and I don't know 
>>>>>> how to
>>>>>> walk up the hierarchy using the web API.  Earlier emails to NCBI
>>>>>> seemed
>>>>>> to indicate this is all they intended to provide, but not sure 
>>>>>> what
>>>>>> the
>>>>>> current status is.
>>>>>>
>>>>>>   my $db = new Bio::DB::Taxonomy(-source => 'entrez'); # use NCBI
>>>>>> Entrez
>>>>>> over HTTP
>>>>>>    my $taxaid = $db->get_taxonid('Homo sapiens');
>>>>>>    my $taxonnode = $db->get_Taxonomy_Node(-taxonid => '9606');
>>>>>>
>>>>>> You can get the full classification if you use the
>>>>>> Bio::DB::Taxonomy::flatfile factory which requires you to have
>>>>>> downloaded the taxonomy db flatfile from NCBI.  Since this is more
>>>>>> reliable (and faster) it is what I have tended to use for grouping
>>>>>> sets
>>>>>> of seqDB search results, etc.
>>>>>>
>>>>>> -jason
>>>>>> On Jan 3, 2005, at 5:40 PM, Peter Robinson wrote:
>>>>>>
>>>>>>> Hi Bioperlers, hi Hilmar,
>>>>>>>
>>>>>>> after some thinking I have embarked on a lex/yacc parser for the
>>>>>>> Entrez
>>>>>>> Gene ASN.1 format as the way of least resistance, although I am 
>>>>>>> not
>>>>>>> sure
>>>>>>> how that would fit in to BioPerl. If anyone is interested in 
>>>>>>> this (or
>>>>>>> has a better idea of how to go about it..), please drop me a 
>>>>>>> line.
>>>>>>>
>>>>>>> In the meantime I have been looking at writing code to parse 
>>>>>>> some of
>>>>>>> the
>>>>>>> "easy" Entrez gene documents, starting off with gene_info. This 
>>>>>>> file
>>>>>>> includes the NCBI taxon id for each entry. I would like to 
>>>>>>> convert
>>>>>>> this
>>>>>>> to a Bio::Species object to pass to the following
>>>>>>> 	my $seq = $self->sequence_factory->create(
>>>>>>> 			     -verbose => $self->verbose(),
>>>>>>> 			     -accession_number => $geneID,
>>>>>>> 			     -desc => $description,
>>>>>>> 			     -display_id => $symbol,
>>>>>>> 			     -species =>  ???
>>>>>>> 			     -annotation => $ann);
>>>>>>>
>>>>>>> and saw the Bio::Taxonomy::FactoryI code, which appears to want 
>>>>>>> to do
>>>>>>> this sort of thing. However, the code for that is pretty 
>>>>>>> preliminary.
>>>>>>> Is
>>>>>>> anyone working on this at the moment? Or is there a better way of
>>>>>>> doing
>>>>>>> this (it seems a shame not to provide the actual species name if 
>>>>>>> one
>>>>>>> has
>>>>>>> the taxid...)
>>>>>>>
>>>>>>> best
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 2004-12-28 at 07:17, Hilmar Lapp wrote:
>>>>>>>> Great to hear that someone is giving this a shot. Yes at this 
>>>>>>>> point
>>>>>>>> is
>>>>>>>> appears that NCBI is only offering the ASN.1, not a conversion 
>>>>>>>> to
>>>>>>>> XML.
>>>>>>>> Their asn2xml tool will not work with this ASN.1 format either, 
>>>>>>>> just
>>>>>>>> checked it to be sure. They do seem to be mulling the option of 
>>>>>>>> XML
>>>>>>>> though on the Gene FAQ. Maybe if enough people get in their ears
>>>>>>>> they
>>>>>>>> will spend some effort towards that. After all, the entrez gene 
>>>>>>>> web
>>>>>>>> interface can display XML on demand - even though it looks 
>>>>>>>> fairly
>>>>>>>> hideous.
>>>>>>>>
>>>>>>>> There is no ASN.1 support in bioperl at all. Also, ASN.1 
>>>>>>>> support in
>>>>>>>> perl is actually thin - there is Convert::ASN1 at version 0.18 
>>>>>>>> two
>>>>>>>> years ago that I could find ... doesn't make me feel warm and 
>>>>>>>> fuzzy.
>>>>>>>>
>>>>>>>> In the absence of any XML available from NCBI, gene_info might 
>>>>>>>> be
>>>>>>>> the
>>>>>>>> best start. An option could be to check for the presence of the
>>>>>>>> other
>>>>>>>> tab-delimited files and use those that are present. These are
>>>>>>>> tab-delimited and hence the format itself is trivial so you can
>>>>>>>> focus
>>>>>>>> entirely on setting up a Bio::Seq plus annotation that's
>>>>>>>> comparable/compatible to what the current SeqIO::locuslink does.
>>>>>>>>
>>>>>>>> My $0.02 (worth less and less almost every day).
>>>>>>>>
>>>>>>>> 	-hilmar
>>>>>>>>
>>>>>>>> On Thursday, December 23, 2004, at 10:51  AM, Peter Robinson 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have been thinking about given a BioPerl EntrezGene parser a 
>>>>>>>>> try
>>>>>>>>> since
>>>>>>>>> I have been a heavy user of locus link to date. One issue is 
>>>>>>>>> that
>>>>>>>>> the
>>>>>>>>> files that correspond to LL_tmpl (which was a flat file) are 
>>>>>>>>> now in
>>>>>>>>> asn
>>>>>>>>> format
>>>>>>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/help/
>>>>>>>>> genehelp.html#query
>>>>>>>>> Although I saw some mention of ASN support in Bioperl by 
>>>>>>>>> googling,
>>>>>>>>> I
>>>>>>>>> can't seem to find any module that does this in the present
>>>>>>>>> distribution. What is the status on that? In any case, I will 
>>>>>>>>> be
>>>>>>>>> working
>>>>>>>>> on this in the next month or two and if anything nice comes of 
>>>>>>>>> it I
>>>>>>>>> will
>>>>>>>>> send it to you / BioPerpl.
>>>>>>>>>
>>>>>>>>> best wishes & happy holidays
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> On Tue, 2004-12-14 at 09:00, Hilmar Lapp wrote:
>>>>>>>>>> Since load_seqdatabase.pl will use bioperl's SeqIO parsers for
>>>>>>>>>> parsing
>>>>>>>>>> any input file, what you're asking is whether or not there is 
>>>>>>>>>> a
>>>>>>>>>> SeqIO
>>>>>>>>>> parser for NCBI Gene.
>>>>>>>>>>
>>>>>>>>>> The answer to that question is no, not yet. Anybody who feels
>>>>>>>>>> motivated
>>>>>>>>>> is welcome to give it a try ... Since I'll need it, I'll 
>>>>>>>>>> write the
>>>>>>>>>> parser if nobody else does within the next 3 months, but I'm 
>>>>>>>>>> not
>>>>>>>>>> going
>>>>>>>>>> to promise when exactly this will happen.
>>>>>>>>>>
>>>>>>>>>> 	-hilmar
>>>>>>>>>>
>>>>>>>>>> On Monday, December 13, 2004, at 08:03  AM, Law, Annie wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I was wondering with regards to bioperl-db the scripts and 
>>>>>>>>>>> schema
>>>>>>>>>>> and
>>>>>>>>>>> load_seqdatabase.pl has there been preparation for 
>>>>>>>>>>> integration of
>>>>>>>>>>> Entrez
>>>>>>>>>>> gene information when locuslink is phased out?  Or if it has
>>>>>>>>>>> already
>>>>>>>>>>> been
>>>>>>>>>>> changed could somebody point
>>>>>>>>>>> me to the documentation or changed code?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Annie.
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioperl-l mailing list
>>>>>>>>>>> Bioperl-l at portal.open-bio.org
>>>>>>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Peter N. Robinson
>>>>>>>>> peter.robinson at t-online.de
>>>>>>>>> peter.robinson at charite.de
>>>>>>>>> http://www.charite.de/ch/medgen/robinson/
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> Peter N. Robinson
>>>>>>> peter.robinson at t-online.de
>>>>>>> peter.robinson at charite.de
>>>>>>> http://www.charite.de/ch/medgen/robinson/
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at portal.open-bio.org
>>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Jason Stajich
>>>>>> jason.stajich at duke.edu
>>>>>> http://www.duke.edu/~jes12/
>>>>> --
>>>>> Peter N. Robinson
>>>>> peter.robinson at t-online.de
>>>>> peter.robinson at charite.de
>>>>> http://www.charite.de/ch/medgen/robinson/
>>>>>
>>>>>
>>>> --
>>>> Jason Stajich
>>>> jason.stajich at duke.edu
>>>> http://www.duke.edu/~jes12/
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at portal.open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>> --
>>> Peter N. Robinson
>>> peter.robinson at t-online.de
>>> peter.robinson at charite.de
>>> http://www.charite.de/ch/medgen/robinson/
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> -- 
> Peter N. Robinson
> peter.robinson at t-online.de
> peter.robinson at charite.de
> http://www.charite.de/ch/medgen/robinson/
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list