[Bioperl-l] embl.pm and virus names

Heikki Lehvaslaiho heikki at nildram.co.uk
Sat Nov 8 09:05:00 EST 2003


Neal,

The Bio::Species class is relatively new in Bioperl and has not been
extensively tested. The EMBL parser simply expects to find a normal
binomial scientific name in every entry. 

I'll try to fix this for viri. The current EMBL parser generates this
kind of structure:

$VAR1 = bless( {
                 '_sub_species' => 'virus',
                 '_classification' => [
                                        'immunodeficiency',
                                        'Human',
                                        'Primate lentivirus group',
                                        'Lentivirus',
                                        'Retroviridae',
                                        'Retroid viruses',
                                        'Viruses'
                                      ]
               }, 'Bio::Species' );

I've now changed my copy of the parser to produce:

$VAR1 = bless( {
                 '_classification' => [
                                        'Human immunodeficiency virus',
                                        'Primate lentivirus group',
                                        'Lentivirus',
                                        'Retroviridae',
                                        'Retroid viruses',
                                        'Viruses'
                                      ]
               }, 'Bio::Species' );


No subspecies and the whole OS line is in first item of the array. I
reasonably happy with this. My only gripe is that if call binomial() on
this object, you get:

'Primate lentivirus group Human immunodeficiency virus'

while genus() gives:

'Primate lentivirus group'

Is this good enough, or can anyone suggest a better solution?

In addition to EMBL, I'll try to make sure that GenBank and SWISS-PROT
parsers treat viri similarly.

	-Heikki

P.S. I could not find any EMBL entries with PCC6803 in OS line, but
given OS line like 'Synechocystis sp. PCC6803',  'PCC6803' should end up
into subspecies().

	-H

On Thu, 2003-11-06 at 16:49, Neil Rawlings wrote:
> I am trying to use the Bio::SeqIO::EMBL to parse EMBL database entries,
> but am having problems whenever I try to retrieve the organism name
> whenever the EMBL entry is for a viral sequence.  I am using the embl.pm
> module and a line such as:
> 
> My ($spec, $genus) = $entry->species->classification();
> 
> But for a virus (which doesn't have a species name - for example "apple
> chlorotic leaf spot virus") I get "Apple chlorotic" as the organism
> name.  I'm not just interested in viruses, so I'm happy when the name
> comes back as "Drosophila melanogaster".  The problem is also apparent
> for some bacteria, especially something like Synechocystis sp. PCC6803
> in which PCC6803 is lost (probably assumed to be a subspecies name). 
> 
> If a solution exists to this problem, please let me know.
> 
> ========================================================================
> ====
> Neil D. Rawlings
> Sanger Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambs CB10 1SA, UK
> 
> Tel: +1223 495330
> Fax: +1223 494919
> E-mail: <A HREF="mailto:ndr at sanger.ac.uk">ndr at sanger.ac.uk</A>
> ========================================================================
> ======
> Please visit the MEROPS database for peptidase classification.  The URL
> is:
>                          <a
> href="merops.sanger.ac.uk">MEROPS.SANGER.AC.UK</a>
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list