[Bioperl-l] Entrez Gene and bioperl-db

Tue Jan 4 16:03:42 EST 2005

On Jan 4, 2005, at 3:52 PM, Peter Robinson wrote:

> Hi Jason,
>
> thanks for the advice. It seems as if the documentation of
> Bio::DB::Taxonomy is a bit out of sync.
>  my $db = new Bio::DB::Taxonomy(-source => 'flatfile'
>                                  -nodesfile => $nodesfile,
>                                  -namesfile => $namefile);
> What does 'flatfile' refer to here? It is not apparent upon looking at 
> the code for new.
>
See Bio::DB::Taxonomy::flatfile for more information.  As I mentioned 
in the mail I sent, flatfile is for downloading the taxonomy DB from 
NCBI.  This lets you run it locally using an indexed  (BerkelyDB via 
DB_File) version of the file.

You must need the most up-to-date verion of the modules - works fine 
for me for both the entrez and flatfile code, but you may have to 
upgrade off of the 1.4.0 release. Code from CVS or the bioperl-1.5 RC1 
code should work fine.

> I had somewhat better luck using the entrez version, but I got a 
> pretty amusing error
> message:
>
> MSG: can't create a species object for Homo sapiens (human) because it
> isn't a species but is a '' instead
>
> ###
> Full error and a dump of the script follow:
>
> my $db = new Bio::DB::Taxonomy(-source => 'entrez'); #
> my $taxaid = $db->get_taxonid('Homo sapiens');
> my $species = $db->get_Taxonomy_Node(-taxonid => '9606');
> print Dumper($species);
>
> ###
>
> Use of uninitialized value in string eq at
> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
> Use of uninitialized value in sprintf at
> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>
> -------------------- WARNING ---------------------
> MSG: can't create a species object for Homo sapiens (human) because it
> isn't a species but is a '' instead
> ---------------------------------------------------
> Use of uninitialized value in string eq at
> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
> Use of uninitialized value in sprintf at
> /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>
> -------------------- WARNING ---------------------
> MSG: can't create a species object for Homo sapiens (human) because it
> isn't a species but is a '' instead
> ---------------------------------------------------
> $VAR1 = {
>           'TaxId' => '9606',
>           'Division' => 'mammals',
>           'GeneNumber' => '32775',
>           'Rank' => 'species',
>           'ProtNumber' => '247791',
>           'ScientificName' => 'Homo sapiens',
>           'CommonName' => 'human',
>           'NucNumber' => '9025800',
>           'GenNumber' => '25',
>           'StructNumber' => '5638'
>         };
> peter at anna:~/programs/bioperlTest$
>
>
> --best, peter
>
> On Mon, 2005-01-03 at 23:51, Jason Stajich wrote:
>> Bio::DB::Taxonomy is the factory code - it is pretty easy to get a
>> species object (or equivalent) using this code.  But you cannot (or
>> could not when I wrote this, not sure of the current status) get the
>> full classification from the NCBI taxonomy retrieval via cgi.  i.e. 
>> you
>> can only get genus and species for a taxon id and I don't know how to
>> walk up the hierarchy using the web API.  Earlier emails to NCBI 
>> seemed
>> to indicate this is all they intended to provide, but not sure what 
>> the
>> current status is.
>>
>>   my $db = new Bio::DB::Taxonomy(-source => 'entrez'); # use NCBI 
>> Entrez
>> over HTTP
>>    my $taxaid = $db->get_taxonid('Homo sapiens');
>>    my $taxonnode = $db->get_Taxonomy_Node(-taxonid => '9606');
>>
>> You can get the full classification if you use the
>> Bio::DB::Taxonomy::flatfile factory which requires you to have
>> downloaded the taxonomy db flatfile from NCBI.  Since this is more
>> reliable (and faster) it is what I have tended to use for grouping 
>> sets
>> of seqDB search results, etc.
>>
>> -jason
>> On Jan 3, 2005, at 5:40 PM, Peter Robinson wrote:
>>
>>> Hi Bioperlers, hi Hilmar,
>>>
>>> after some thinking I have embarked on a lex/yacc parser for the 
>>> Entrez
>>> Gene ASN.1 format as the way of least resistance, although I am not
>>> sure
>>> how that would fit in to BioPerl. If anyone is interested in this (or
>>> has a better idea of how to go about it..), please drop me a line.
>>>
>>> In the meantime I have been looking at writing code to parse some of
>>> the
>>> "easy" Entrez gene documents, starting off with gene_info. This file
>>> includes the NCBI taxon id for each entry. I would like to convert 
>>> this
>>> to a Bio::Species object to pass to the following
>>> 	my $seq = $self->sequence_factory->create(
>>> 			     -verbose => $self->verbose(),
>>> 			     -accession_number => $geneID,
>>> 			     -desc => $description,
>>> 			     -display_id => $symbol,
>>> 			     -species =>  ???
>>> 			     -annotation => $ann);
>>>
>>> and saw the Bio::Taxonomy::FactoryI code, which appears to want to do
>>> this sort of thing. However, the code for that is pretty preliminary.
>>> Is
>>> anyone working on this at the moment? Or is there a better way of 
>>> doing
>>> this (it seems a shame not to provide the actual species name if one
>>> has
>>> the taxid...)
>>>
>>> best
>>>
>>> Peter
>>>
>>>
>>>
>>> On Tue, 2004-12-28 at 07:17, Hilmar Lapp wrote:
>>>> Great to hear that someone is giving this a shot. Yes at this point 
>>>> is
>>>> appears that NCBI is only offering the ASN.1, not a conversion to 
>>>> XML.
>>>> Their asn2xml tool will not work with this ASN.1 format either, just
>>>> checked it to be sure. They do seem to be mulling the option of XML
>>>> though on the Gene FAQ. Maybe if enough people get in their ears 
>>>> they
>>>> will spend some effort towards that. After all, the entrez gene web
>>>> interface can display XML on demand - even though it looks fairly
>>>> hideous.
>>>>
>>>> There is no ASN.1 support in bioperl at all. Also, ASN.1 support in
>>>> perl is actually thin - there is Convert::ASN1 at version 0.18 two
>>>> years ago that I could find ... doesn't make me feel warm and fuzzy.
>>>>
>>>> In the absence of any XML available from NCBI, gene_info might be 
>>>> the
>>>> best start. An option could be to check for the presence of the 
>>>> other
>>>> tab-delimited files and use those that are present. These are
>>>> tab-delimited and hence the format itself is trivial so you can 
>>>> focus
>>>> entirely on setting up a Bio::Seq plus annotation that's
>>>> comparable/compatible to what the current SeqIO::locuslink does.
>>>>
>>>> My $0.02 (worth less and less almost every day).
>>>>
>>>> 	-hilmar
>>>>
>>>> On Thursday, December 23, 2004, at 10:51  AM, Peter Robinson wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have been thinking about given a BioPerl EntrezGene parser a try
>>>>> since
>>>>> I have been a heavy user of locus link to date. One issue is that 
>>>>> the
>>>>> files that correspond to LL_tmpl (which was a flat file) are now in
>>>>> asn
>>>>> format
>>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/help/
>>>>> genehelp.html#query
>>>>> Although I saw some mention of ASN support in Bioperl by googling, 
>>>>> I
>>>>> can't seem to find any module that does this in the present
>>>>> distribution. What is the status on that? In any case, I will be
>>>>> working
>>>>> on this in the next month or two and if anything nice comes of it I
>>>>> will
>>>>> send it to you / BioPerpl.
>>>>>
>>>>> best wishes & happy holidays
>>>>>
>>>>> Peter
>>>>>
>>>>> On Tue, 2004-12-14 at 09:00, Hilmar Lapp wrote:
>>>>>> Since load_seqdatabase.pl will use bioperl's SeqIO parsers for
>>>>>> parsing
>>>>>> any input file, what you're asking is whether or not there is a
>>>>>> SeqIO
>>>>>> parser for NCBI Gene.
>>>>>>
>>>>>> The answer to that question is no, not yet. Anybody who feels
>>>>>> motivated
>>>>>> is welcome to give it a try ... Since I'll need it, I'll write the
>>>>>> parser if nobody else does within the next 3 months, but I'm not
>>>>>> going
>>>>>> to promise when exactly this will happen.
>>>>>>
>>>>>> 	-hilmar
>>>>>>
>>>>>> On Monday, December 13, 2004, at 08:03  AM, Law, Annie wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was wondering with regards to bioperl-db the scripts and schema
>>>>>>> and
>>>>>>> load_seqdatabase.pl has there been preparation for integration of
>>>>>>> Entrez
>>>>>>> gene information when locuslink is phased out?  Or if it has
>>>>>>> already
>>>>>>> been
>>>>>>> changed could somebody point
>>>>>>> me to the documentation or changed code?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Annie.
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at portal.open-bio.org
>>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>
>>>>>>>
>>>>> -- 
>>>>> Peter N. Robinson
>>>>> peter.robinson at t-online.de
>>>>> peter.robinson at charite.de
>>>>> http://www.charite.de/ch/medgen/robinson/
>>>>>
>>>>>
>>> -- 
>>> Peter N. Robinson
>>> peter.robinson at t-online.de
>>> peter.robinson at charite.de
>>> http://www.charite.de/ch/medgen/robinson/
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>> --
>> Jason Stajich
>> jason.stajich at duke.edu
>> http://www.duke.edu/~jes12/
> -- 
> Peter N. Robinson
> peter.robinson at t-online.de
> peter.robinson at charite.de
> http://www.charite.de/ch/medgen/robinson/
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/