[Bioperl-l] Oddness in Bio::SeqIO
Torsten Seemann
torsten.seemann at infotech.monash.edu.au
Tue May 9 23:42:29 UTC 2006
Chris,
> I noticed an odd thing with SeqIO parsing of species lines (those
> problematic bacterial tax names again). I have a simple script that runs
> output to STDOUT to generate a list of hits. Here's what I get:
> Bacterium: Mycobacterium avium subsp. paratuberculosis K-10 paratuberculosis
> K-10 <--
In this case,
Genus = Mycobacterium
Species = avium
Subspecies = paratuberculosis
Strain = K-10
which suggests that BioPerl is trying to handle something special,
because the 'subsp.' is gone?
Here's the pertinent parts of the Genbank file
(apologies for the wrapping):
LOCUS NC_002944 4829781 bp DNA circular BCT
18-JAN-2006
DEFINITION Mycobacterium avium subsp. paratuberculosis K-10, complete
genome.
SOURCE Mycobacterium avium subsp. paratuberculosis K-10
ORGANISM Mycobacterium avium subsp. paratuberculosis K-10
Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
Corynebacterineae; Mycobacteriaceae; Mycobacterium;
Mycobacterium
avium complex (MAC).
/organism="Mycobacterium avium subsp.
paratuberculosis K-10"
/strain="K-10"
/sub_species="paratuberculosis"
> Most (but not all) of the strain numbers get repeated (marked with arrows).
> This is actually in the GenBank file itself, downloaded via Bio::DB::GenBank
> (and thus passed through Bio::SeqIO). Anyone seen this before?
The problem is mentioned in the wiki so it must have come up before?
http://bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data
I also deal with Bacteria mainly, and should also look into this. I
haven't been using the genbank headers directly, only the features, so i
never came across this.
Another thing which may crop up is when no Species has been allocated
yet but the genus is known (or something like that). In that case the
name is written as "Genus spp." eg. Gallibacterium spp.
--Torsten
More information about the Bioperl-l
mailing list