[Bioperl-l] Bio::*Taxonomy* changes

Sendu Bala bix at sendu.me.uk
Thu Jul 20 22:47:33 UTC 2006


Chris Fields wrote:
> As for caching,
> do you mean caching of the tax information or the sequence ID information?

Anything you get from entrez.


> Caching of tax information would be great, but how would you go about it?  I
> can see how it would be easy to have a cache for the flatfile using a local
> index, but not so much for XML data retrieved from Entrez (a
> continually-appended local file, maybe, with a n accompanying index file?).

I didn't actually mean a stored file (but that would be possible with a 
tied hash or something: DB_File, just like flatfile), but an in-memory 
one for use during the course of program execution. Stored file would 
probably be dangerous because you wouldn't know if the data has become 
stale or not - and checking to see if it wasn't would defeat the point.


>> The problem is, genus() and species() are special cases that aren't
>> normally directly set. They get their values from the classification
>> array: genus() returns (classification())[1] and species() returns
>> (classification())[0]. They set the same values. Doing this is only sane
>> (though is still likely to be wrong, given that there can be ranks
>> between species and genus) when the node is of rank 'species', hence the
>> warnings.
>>
>> I imagine this is to work with pesky file formats like genbank, so I
>> can't really change anything here without major overhaul. And my plans
>> for overhaul involve getting rid of genus() and species(), so I'll just
>> leave them be for now.
> 
> This would all depend on where the information came from; if the information
> came from the Entrez XML <LineageEx> element data:
> 
[snip]
> 
> The subspecies(), genus(), and species() could all be set from this instead
> of the classification array.  The problem lies then with the flatfile data
> and how it would be parsed out, if that's at all possible with the flatfile
> data.  If not, I see why you would rather have this return a stripped-down
> Bio::Taxonomy::Node object.
> 
> I would have to look at how everything is indexed in
> Bio::DB::Taxonomy::entrez, but I think it's feasible.

entrez already parses through LineageEx to build the classification 
array. flatfile walks up all the parents to do the same. Having the 
information isn't the issue. We have the information. The methods 
genus() and species() need to work with the genbank fileformat, that is 
the problem.



More information about the Bioperl-l mailing list