[Bioperl-l] Oddness in Bio::SeqIO

Chris Fields cjfields at uiuc.edu
Wed May 10 15:46:27 UTC 2006


This actually pops up when using $seq->species->common_name; using
$seq->species->binomial chops some of the strain designations off, so really
neither one works optimally for bacterial genus-species-strain taxonomy.
Hilmar made the suggestion that it's probably best to grab the NCBI TaxID
and parse it out that way by looking it up in the taxonomy database (using
Bio::DB::Taxonomy), but at the moment that's not what Bio::SeqIO::genbank
does.  

I wonder if we should be trying to shove most of this stuff into species
objects directly from the beginning; in other words, maybe we should try to
get the information in Bio::Annotation objects and then, after the
parsing/IO is finished, have a method to get the information into
Bio::Species objects when wanted/needed; a check could be added against the
NCBI Taxonomy database there.  

Anyway, I really haven't looked at how they are parsed out and don't have
the time at the moment.  I may look into this as well but not until I get
back from conference (end of May).  Jason and Brian have been calling for a
refactoring of Bio::SeqIO::genbank for a while; maybe it's getting time to
do something about it...

Chris 

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Torsten Seemann
> Sent: Tuesday, May 09, 2006 6:42 PM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Oddness in Bio::SeqIO
> 
> Chris,
> 
> > I noticed an odd thing with SeqIO parsing of species lines (those
> > problematic bacterial tax names again).  I have a simple script that
> runs
> > output to STDOUT to generate a list of hits.  Here's what I get:
> 
> > Bacterium: Mycobacterium avium subsp. paratuberculosis K-10
> paratuberculosis
> > K-10 <--
> 
> In this case,
> 
> Genus = Mycobacterium
> Species = avium
> Subspecies = paratuberculosis
> Strain = K-10
> 
> which suggests that BioPerl is trying to handle something special,
> because the 'subsp.' is gone?
> 
> Here's the pertinent parts of the Genbank file
> (apologies for the wrapping):
> 
> LOCUS       NC_002944            4829781 bp    DNA     circular BCT
> 18-JAN-2006
> DEFINITION  Mycobacterium avium subsp. paratuberculosis K-10, complete
> genome.
> SOURCE      Mycobacterium avium subsp. paratuberculosis K-10
>    ORGANISM  Mycobacterium avium subsp. paratuberculosis K-10
>              Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>              Corynebacterineae; Mycobacteriaceae; Mycobacterium;
> Mycobacterium
>              avium complex (MAC).
> 
>                       /organism="Mycobacterium avium subsp.
> paratuberculosis K-10"
>                       /strain="K-10"
>                       /sub_species="paratuberculosis"
> 
> 
> > Most (but not all) of the strain numbers get repeated (marked with
> arrows).
> > This is actually in the GenBank file itself, downloaded via
> Bio::DB::GenBank
> > (and thus passed through Bio::SeqIO).  Anyone seen this before?
> 
> The problem is mentioned in the wiki so it must have come up before?
> http://bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data
> 
> I also deal with Bacteria mainly, and should also look into this. I
> haven't been using the genbank headers directly, only the features, so i
> never came across this.
> 
> Another thing which may crop up is when no Species has been allocated
> yet but the genus is known (or something like that). In that case the
> name is written as "Genus spp." eg.  	 Gallibacterium spp.
> 
> --Torsten
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list