[Bioperl-l] Re: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid

Heikki Lehvaslaiho heikki at ebi.ac.uk
Thu Mar 11 10:06:10 EST 2004


James,

I think you are on the right track here.  Getting the taxid and the higher 
taxa is more important than the lack of binomial name. 

I am moving this discussion from guts into bioperl-l list where it belongs.

Thank your your contribution. Putting a patch into bugzilla as an attachment 
is exactly the right thing to do.

	-Heikki

On Thursday 11 Mar 2004 14:39, James Wasmuth wrote:
> Brian and all at bioperl-guts,
>
>
> below is the comment I've added to the bug[1600].  I think it may need
> some discussion, but the patch I've added works to the extent that it
> allows creation of a Bio::Species object but the subsequent genus,
> species, subspecies calls will be 'wrong'.  Personally I'm more
> concerned with the taxid, which I think will be sufficient.
>
> If you want to see the size of this problem go to NCBI taxonomy and
> enter the term identified as a token set!  I think that maintaining the
> taxid is enough, otherwise the artifical split of terms such as
> **unidentified diatom endosymbiont of Peridinium foliaceum*
> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=4224
>7&lvl=3&lin=f&keep=1&srchmode=3&unlock>* may be a problem, though some of
> them are intuitive.
>
> One last question, I've never tried to fix a bug before, so I've
> commited a patch as an attachment to Bugzilla for the bug.  Do others
> check this and if okay place it in the code...
> apologies for the newbie bit...
>
> -james
>
>
>
> genbank.pm
>
> line 1123: return unless $genus and  $genus !~ /^(Unknown|None)$/oi;
>
> a number of species are described as Unknown blah blah blah.
>
> The NCBI taxid assigned to unknown taxa is 32644 and has a number of
> synonyms, none of which are 'unknown'.
>
> The list includes: other, unknown organism, not specified, not shown,
> unspecified, Unknown, None, unclassified , unidentified organism
>
> I've changed the _read_GenBank_Species subroutine to allow organism
> names such as 'unknown marine gamma proteobacterium NOR5'.  This will
> create a Bio::Species object, but the genus=unknown species=marine
> subspecies=gamma.
>
> There is a whole host of species names that ignore the nice rules in
> _read_GenBank_Species!  However this fix will allow the correct taxid to
> be provided which I think is more than the name!
>
>
>
> sub _read_GenBank_Species {
>     my( $self,$buffer) = @_;
>     my @organell_names = ("chloroplast", "mitochondr");
>      # only those carrying DNA, apart from the nucleus
>
>     #CHANGE
>      my @unkn_names=("other", 'unknown organism', 'not specified', 'not
> shown', 'Unspecified', 'Unknown', 'None', 'unclassified', 'unidentified
> organism');
>
>     $_ = $$buffer;
>
>     my( $sub_species, $species, $genus, $common, $organelle, @class,
> $ns_name );
>     # upon first entering the loop, we must not read a new line -- the
> SOURCE
>     # line is already in the buffer (HL 05/10/2000)
>     while (defined($_) || defined($_ = $self->_readline())) {
>     # de-HTMLify (links that may be encountered here don't contain
>     # escaped '>', so a simple-minded approach suffices)
>         s/<[^>]+>//g;
>     if (/^SOURCE\s+(.*)/o) {
>         # FIXME this is probably mostly wrong (e.g., it yields things like
>         # Homo sapiens adult placenta cDNA to mRNA
>         # which is certainly not what you want)
>         $common = $1;
>         $common =~ s/\.$//; # remove trailing dot
>     } elsif (/^\s{2}ORGANISM/o) {
>         my @spflds = split(' ', $_);
>             ($ns_name) = $_ =~ /\w+\s+(.*)/o;
>         shift(@spflds); # ORGANISM
>
>          if(grep { $_ =~ /^$spflds[0]/i; } @organell_names) {
>         $organelle = shift(@spflds);
>         }
>             $genus = shift(@spflds);
>         if(@spflds) {
>         $species = shift(@spflds);
>         } elsif ( grep { $genus } @unkn_names){
>         $species = '';
>         } else {$species='sp.';}      #there's no species name but it
> isn't unclassified
>         $sub_species = shift(@spflds) if(@spflds);
>         } elsif (/^\s+(.+)/o) {
>         # only split on ';' or '.' so that
>         # classification that is 2 words will
>         # still get matched
>         # use map to remove trailing/leading spaces
>             push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/,
> $1);
>         } else {
>             last;
>         }
>
>         $_ = undef; # Empty $_ to trigger read of next line
>     }
>
>      $$buffer = $_;
>
>      # Don't make a species object if it's empty or "Unknown" or "None"
>     my $unkn = grep { $_ =~ /^$genus$species/i; } @unkn_names;
>
>      return unless $genus and  $unkn==0;
>
>      # Bio::Species array needs array in Species -> Kingdom direction
>     if ($class[0] eq 'Viruses') {
>         push( @class, $ns_name );
>     }
>     elsif ($class[$#class] eq $genus) {
>         push( @class, $species );
>     } else {
>         push( @class, $genus, $species );
>     }
>     @class = reverse @class;
>
>     my $make = Bio::Species->new();
>     $make->classification( \@class, "FORCE" ); # no name validation please
>     $make->common_name( $common      ) if $common;
>     unless ($class[-1] eq 'Viruses') {
>         $make->sub_species( $sub_species ) if $sub_species;
>     }
>     $make->organelle($organelle) if $organelle;
>     return $make;
> }
>
> Brian Osborne wrote:
> >James,
> >
> >Your guess is right, no Species is made because of the name. That's
> > because genbank.pm normally looks at:
> >
> >ORGANISM Bos taurus
> >
> >And makes "Bos" the genus, and so on.
> >
> >If it sees:
> >
> >ORGANISM Unknown
> >
> >It refuses to make a Species object, and it's interpreting your ORGANISM
> >line in the same way because it can't make a valid genus, that's the
> > current rule. Personally I'd say that I agree with its principle - how
> > can we make a Species object without genus and species?
> >
> >You can get the taxid from a SeqFeature object, you already knew that.
> >
> >Brian O.
> >
> >
> >-----Original Message-----
> >From: bioperl-guts-l-bounces at portal.open-bio.org
> >[mailto:bioperl-guts-l-bounces at portal.open-bio.org]On Behalf Of
> >bugzilla-daemon at portal.open-bio.org
> >Sent: Thursday, March 11, 2004 4:21 AM
> >To: bioperl-guts-l at bioperl.org
> >Subject: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
> >
> >http://bugzilla.bioperl.org/show_bug.cgi?id=1600
> >
> >           Summary: $gb->species->ncbi_taxid
> >           Product: Bioperl
> >           Version: unspecified
> >          Platform: PC
> >        OS/Version: Linux
> >            Status: NEW
> >          Severity: normal
> >          Priority: P2
> >         Component: Bio::SeqIO
> >        AssignedTo: bioperl-guts-l at bioperl.org
> >        ReportedBy: james.wasmuth at ed.ac.uk
> >
> >
> >I've included a genbank file for which I have been unable to extract the
> >ncbi_taxid for using
> >
> >$gb->species->ncbi_taxid
> >
> >the error is:
> >Can't call method "ncbi_taxid" on an undefined value
> >
> >infact I don't get a Bio::Species object.  I'm sure its because of the
> > name, which is correct.
> >
> >I've tried looking into it, but could not find which Seq object creates
> > the Bio::Species object.
> >
> >
> >
> >LOCUS       AY007676                1389 bp    DNA     linear   BCT
> >29-OCT-2001
> >DEFINITION  Unknown marine gamma proteobacterium NOR5 16S ribosomal RNA,
> >            partial sequence.
> >ACCESSION   AY007676
> >VERSION     AY007676.1  GI:12000362
> >KEYWORDS    .
> >SOURCE      unknown marine gamma proteobacterium NOR5
> >  ORGANISM  unknown marine gamma proteobacterium NOR5
> >            Bacteria; Proteobacteria; Gammaproteobacteria.
> >REFERENCE   1  (bases 1 to 1389)
> >  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Glockner,F.O., Gerdts,G.
> >and
> >            Amann,R.
> >  TITLE     Isolation of novel pelagic bacteria from the German bight and
> >their
> >            seasonal contributions to surface picoplankton
> >  JOURNAL   Appl. Environ. Microbiol. 67 (11), 5134-5142 (2001)
> >  MEDLINE   21536174
> >   PUBMED   11679337
> >REFERENCE   2  (bases 1 to 1389)
> >  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
> > Gerdts,G., Schuett,C. and Amann,R.
> >  TITLE     Identification and seasonal dominance of culturable marine
> >bacteria
> >  JOURNAL   Unpublished
> >REFERENCE   3  (bases 1 to 1389)
> >  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
> > Gerdts,G., Schuett,C. and Amann,R.
> >  TITLE     Direct Submission
> >  JOURNAL   Submitted (29-AUG-2000) Molecular Ecology,
> > Max-Planck-Institute, Celsiusstrasse 1, Bremen 28359, Germany
> >FEATURES             Location/Qualifiers
> >     source          1..1389
> >                     /organism="unknown marine gamma proteobacterium NOR5"
> >                     /mol_type="genomic DNA"
> >                     /db_xref="taxon:145658"
> >     rRNA            <1..>1389
> >                     /product="16S ribosomal RNA"
> >BASE COUNT      343 a    319 c    453 g    274 t
> >ORIGIN
> >        1 cgcgaaagta cttcggtatg agtagagcgg cggacgggtg agtaacgcgt
> > aggaatctat 61 ccagtagtgg gggacaactc ggggaaactc gagctaatac cgcatacgtc
> > ctaagggaga 121 aagcggggga tcttcggacc tcgcgctatt ggaggagcct gcgttggatt
> > agctagttgg 181 tggggtaaag gcctaccaag gcgacgatcc atagctggtc tgagaggatg
> > atcagccaca 241 ccgggactga gacacggccc ggactcctac gggaggcagc agtggggaat
> > attgcgcaat 301 gggcgaaagc ctgacgcagc catgccgcgt gtgtgaagaa ggccttcggg
> > ttgtaaagca 361 ctttcaattg ggaagaaagg ttagtagtta ataactgcta gctgtgacat
> > tacctttaga 421 agaagcaccg gctaactccg tgccagcagc cgcggtaata cggaggtgcg
> > agcgttaatc 481 ggaattactg ggcgtaaagc gcgcgtaggc ggtctgttaa gtcggatgtg
> > aaagccccgg 541 gctcaacctg ggaattgcac ccgatactgg ccgactggag tgcgagagag
> > ggaggtagaa 601 ttccacgtgt agcggtgaaa tgcgtagata tgtggaggaa taccggtggc
> > gaaggcggcc 661 tcctggctcg acactgacgc tgaggtgcga aagcgtgggg agcaaacagg
> > attagatacc 721 ctggtagtcc acgccgtaaa cgatgtctac tagccgttgg gagacttgat
> > ttcttggtgg 781 cgaagttaac gcgataagta gaccgcctgg ggagtacggc cgcaaggtta
> > aaactcaaat 841 gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgatg
> > caacgcgaag 901 aaccttacca ggccttgaca tcctaggaat cctgtagaga tacgggagtg
> > ccttcgggaa 961 tctagtgaca ggtgctgcat ggctgtcgtc agctcgtgtc gtgagatgtt
> > gggttaagtc 1021 ccgtaacgag cgcaaccctt gtccttagtt gccagcgcgt aatggcggga
> > actctaagga 1081 gactgccggt gacaaaccgg aggaaggtgg ggacgacgtc aagtcatcat
> > ggcccttacg 1141 gcctgggcta cacacgtgct acaatggaac gcacagaggg cagcaaaccc
> > gcgaggggga 1201 gcgaatccca caaaacgttt cgtagtccgg atcggagtct gcaactcgac
> > tccgtgaagt 1261 cggaatcgct agtaatcgtg aatcagaatg tcacggtgaa tacgttcccg
> > ggccttgtac 1321 acaccgcccg tcacaccatg ggagtgggtt gctccagaag tggttagcct
> > aaccttcggg 1381 agggcgatc
> >//
> >
> >
> >
> >------- You are receiving this mail because: -------
> >You are the assignee for the bug, or are watching the assignee.
> >_______________________________________________
> >Bioperl-guts-l mailing list
> >Bioperl-guts-l at portal.open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/bioperl-guts-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki_at_ebi ac uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________


More information about the Bioperl-l mailing list