[Bioperl-l] Re: New: $gb->species->ncbi_taxid
Heikki Lehvaslaiho
heikki at ebi.ac.uk
Mon Mar 15 07:59:58 EST 2004
James,
Would it make sense to try to always leave genus() undef when the species name
is tentativie and does not follow binomial rule? That would be then provide a
way to test for "well behaving", standard species objects.
-Heikki
On Monday 15 Mar 2004 12:00, James Wasmuth wrote:
> Dear All,
>
> As no-one objected, I'm going ahead with the plan:
>
>
> ORGANISM: unknown marine gamma proteobacterium NOR5
> $genus = "not given" and $species = "unknown marine gamma
> proteobacterium NOR5"
>
> ORGANISM: uncultured gamma proteobacterium
> $genus = "not given" and $species= "uncultured gamma proteobacterium"
> ***
> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8647
>3&lvl=3&p=17&p=20&p=37&p=38&lin=f&keep=1&srchmode=3&unlock>*
>
> ORGANISM: leaf litter basidiomycete sp. Collb2-39
> $genus = "leaf", $species="litter", $subspecies="basidiomycete sp.
> Collb2-39"
>
> ORGANISM: Drosophila sp. 'white tip scutellum'
> $genus = "Drosophila", $species="sp.", $subspecies="'white tip scutellum'"
>
> At present: $genus = "Drosophila", $species="sp.", $subspecies="'white"
>
>
> ORGANISM: marine bacterium HP3
> $genus = "marine", $species="bacterium", $subspecies="HP3"
>
> for the plant people, I'll make var. (varietas) similar to sp.
>
> I'm not completely happy with the third example but if someone could
> suggest something better and robust, then I'll code it.
>
> The organisation of the taxonomy is suitably complex, so I think that
> the user should be given some credit, and if they see that the genus is
> "marine", then they can investigate further...
>
>
> -james
>
> James Wasmuth wrote:
> > Brian,
> >
> >> Yes, I could read your patch but I'm lazy. You said:
> >>>> create a Bio::Species object, but the genus=unknown species=marine
> >>
> >> subspecies=gamma.
> >
> > I was highlighting the problem my patch would create. I hadn't
> > thought too hard about its consequences, but realise that it may have
> > some knock-on effects.
> >
> >> Shouldn't the values be the same for all these "species" for which
> >> the genus
> >> is not known? Like:
> >>
> >> Genus=unknown, species=unknown, subspecies=unknown
> >>
> >> That way you can check, since one can no longer use "unless defined
> >> $species_object" to see if real species information is lacking or
> >> not. Have
> >> I missed something here?
> >
> > NCBI taxonomy considers the term 'unknown marine gamma proteobacterium
> > NOR5' to represent a species, though for this example there remaining
> > taxonomy classification is awarded no rank until 'class'.
> > So one possible fix, would be:
> >
> > if (ORGANISM ne "synonym for taxid 32644") {
> > then add rest of name into $species.
> > }
> >
> > therefore $genus = unknown and $species = marine gamma
> > proteobacterium NOR5
> >
> > other problems are organisms such as "leaf litter basidiomycete sp.
> > Collb2-39". Currently $genus = leaf, $species = litter and $subsp =
> > basidiomycete. Perhaps $subsp should contain everything left over?
> > Thoughts? Also does anyone know off hand if it copes with 'varietas'
> > and 'var.' for plants? I expect not. I will have a look at the
> > genbank.pm on Monday and suggest a patch. I expect these issues will
> > also be pertinent in embl.pm and other database format modules...
> >
> >
> > hmmmm, anyone with thoughts?
> >
> > -james
> >
> >> Brian O.
> >>
> >>
> >> -----Original Message-----
> >> From: James Wasmuth [mailto:james.wasmuth at ed.ac.uk]
> >> Sent: Thursday, March 11, 2004 9:40 AM
> >> To: Brian Osborne
> >> Cc: bioperl-guts-l at bioperl.org
> >> Subject: Re: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
> >>
> >> Brian and all at bioperl-guts,
> >>
> >>
> >> below is the comment I've added to the bug[1600]. I think it may need
> >> some discussion, but the patch I've added works to the extent that it
> >> allows creation of a Bio::Species object but the subsequent genus,
> >> species, subspecies calls will be 'wrong'. Personally I'm more
> >> concerned with the taxid, which I think will be sufficient.
> >>
> >> If you want to see the size of this problem go to NCBI taxonomy and
> >> enter the term identified as a token set! I think that maintaining the
> >> taxid is enough, otherwise the artifical split of terms such as
> >> **unidentified diatom endosymbiont of Peridinium foliaceum*
> >> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=4
> >>2247
> >>
> >> &lvl=3&lin=f&keep=1&srchmode=3&unlock>*
> >> may be a problem, though some of them are intuitive.
> >>
> >> One last question, I've never tried to fix a bug before, so I've
> >> commited a patch as an attachment to Bugzilla for the bug. Do others
> >> check this and if okay place it in the code...
> >> apologies for the newbie bit...
> >>
> >> -james
> >>
> >>
> >>
> >> genbank.pm
> >>
> >> line 1123: return unless $genus and $genus !~ /^(Unknown|None)$/oi;
> >>
> >> a number of species are described as Unknown blah blah blah.
> >>
> >> The NCBI taxid assigned to unknown taxa is 32644 and has a number of
> >> synonyms, none of which are 'unknown'.
> >>
> >> The list includes: other, unknown organism, not specified, not shown,
> >> unspecified, Unknown, None, unclassified , unidentified organism
> >>
> >> I've changed the _read_GenBank_Species subroutine to allow organism
> >> names such as 'unknown marine gamma proteobacterium NOR5'. This will
> >> create a Bio::Species object, but the genus=unknown species=marine
> >> subspecies=gamma.
> >>
> >> There is a whole host of species names that ignore the nice rules in
> >> _read_GenBank_Species! However this fix will allow the correct taxid to
> >> be provided which I think is more than the name!
> >>
> >>
> >>
> >> sub _read_GenBank_Species {
> >> my( $self,$buffer) = @_;
> >> my @organell_names = ("chloroplast", "mitochondr");
> >> # only those carrying DNA, apart from the nucleus
> >>
> >> #CHANGE
> >> my @unkn_names=("other", 'unknown organism', 'not specified', 'not
> >> shown', 'Unspecified', 'Unknown', 'None', 'unclassified', 'unidentified
> >> organism');
> >>
> >> $_ = $$buffer;
> >>
> >> my( $sub_species, $species, $genus, $common, $organelle, @class,
> >> $ns_name );
> >> # upon first entering the loop, we must not read a new line -- the
> >> SOURCE
> >> # line is already in the buffer (HL 05/10/2000)
> >> while (defined($_) || defined($_ = $self->_readline())) {
> >> # de-HTMLify (links that may be encountered here don't contain
> >> # escaped '>', so a simple-minded approach suffices)
> >> s/<[^>]+>//g;
> >> if (/^SOURCE\s+(.*)/o) {
> >> # FIXME this is probably mostly wrong (e.g., it yields things
> >> like
> >> # Homo sapiens adult placenta cDNA to mRNA
> >> # which is certainly not what you want)
> >> $common = $1;
> >> $common =~ s/\.$//; # remove trailing dot
> >> } elsif (/^\s{2}ORGANISM/o) {
> >> my @spflds = split(' ', $_);
> >> ($ns_name) = $_ =~ /\w+\s+(.*)/o;
> >> shift(@spflds); # ORGANISM
> >>
> >> if(grep { $_ =~ /^$spflds[0]/i; } @organell_names) {
> >> $organelle = shift(@spflds);
> >> }
> >> $genus = shift(@spflds);
> >> if(@spflds) {
> >> $species = shift(@spflds);
> >> } elsif ( grep { $genus } @unkn_names){
> >> $species = '';
> >> } else {$species='sp.';} #there's no species name but it
> >> isn't unclassified
> >> $sub_species = shift(@spflds) if(@spflds);
> >> } elsif (/^\s+(.+)/o) {
> >> # only split on ';' or '.' so that
> >> # classification that is 2 words will
> >> # still get matched
> >> # use map to remove trailing/leading spaces
> >> push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/,
> >> $1);
> >> } else {
> >> last;
> >> }
> >>
> >> $_ = undef; # Empty $_ to trigger read of next line
> >> }
> >>
> >> $$buffer = $_;
> >>
> >> # Don't make a species object if it's empty or "Unknown" or "None"
> >> my $unkn = grep { $_ =~ /^$genus$species/i; } @unkn_names;
> >>
> >> return unless $genus and $unkn==0;
> >>
> >> # Bio::Species array needs array in Species -> Kingdom direction
> >> if ($class[0] eq 'Viruses') {
> >> push( @class, $ns_name );
> >> }
> >> elsif ($class[$#class] eq $genus) {
> >> push( @class, $species );
> >> } else {
> >> push( @class, $genus, $species );
> >> }
> >> @class = reverse @class;
> >>
> >> my $make = Bio::Species->new();
> >> $make->classification( \@class, "FORCE" ); # no name validation
> >> please
> >> $make->common_name( $common ) if $common;
> >> unless ($class[-1] eq 'Viruses') {
> >> $make->sub_species( $sub_species ) if $sub_species;
> >> }
> >> $make->organelle($organelle) if $organelle;
> >> return $make;
> >> }
> >>
> >> Brian Osborne wrote:
> >>> James,
> >>>
> >>> Your guess is right, no Species is made because of the name. That's
> >>> because
> >>> genbank.pm normally looks at:
> >>>
> >>> ORGANISM Bos taurus
> >>>
> >>> And makes "Bos" the genus, and so on.
> >>>
> >>> If it sees:
> >>>
> >>> ORGANISM Unknown
> >>>
> >>> It refuses to make a Species object, and it's interpreting your
> >>> ORGANISM
> >>> line in the same way because it can't make a valid genus, that's the
> >>
> >> current
> >>
> >>> rule. Personally I'd say that I agree with its principle - how can
> >>> we make
> >>
> >> a
> >>
> >>> Species object without genus and species?
> >>>
> >>> You can get the taxid from a SeqFeature object, you already knew that.
> >>>
> >>> Brian O.
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: bioperl-guts-l-bounces at portal.open-bio.org
> >>> [mailto:bioperl-guts-l-bounces at portal.open-bio.org]On Behalf Of
> >>> bugzilla-daemon at portal.open-bio.org
> >>> Sent: Thursday, March 11, 2004 4:21 AM
> >>> To: bioperl-guts-l at bioperl.org
> >>> Subject: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
> >>>
> >>> http://bugzilla.bioperl.org/show_bug.cgi?id=1600
> >>>
> >>> Summary: $gb->species->ncbi_taxid
> >>> Product: Bioperl
> >>> Version: unspecified
> >>> Platform: PC
> >>> OS/Version: Linux
> >>> Status: NEW
> >>> Severity: normal
> >>> Priority: P2
> >>> Component: Bio::SeqIO
> >>> AssignedTo: bioperl-guts-l at bioperl.org
> >>> ReportedBy: james.wasmuth at ed.ac.uk
> >>>
> >>>
> >>> I've included a genbank file for which I have been unable to extract
> >>> the
> >>> ncbi_taxid for using
> >>>
> >>> $gb->species->ncbi_taxid
> >>>
> >>> the error is:
> >>> Can't call method "ncbi_taxid" on an undefined value
> >>>
> >>> infact I don't get a Bio::Species object. I'm sure its because of the
> >>
> >> name,
> >>
> >>> which is correct.
> >>>
> >>> I've tried looking into it, but could not find which Seq object
> >>> creates the
> >>> Bio::Species object.
> >>>
> >>>
> >>>
> >>> LOCUS AY007676 1389 bp DNA linear BCT
> >>> 29-OCT-2001
> >>> DEFINITION Unknown marine gamma proteobacterium NOR5 16S ribosomal
> >>> RNA,
> >>> partial sequence.
> >>> ACCESSION AY007676
> >>> VERSION AY007676.1 GI:12000362
> >>> KEYWORDS .
> >>> SOURCE unknown marine gamma proteobacterium NOR5
> >>> ORGANISM unknown marine gamma proteobacterium NOR5
> >>> Bacteria; Proteobacteria; Gammaproteobacteria.
> >>> REFERENCE 1 (bases 1 to 1389)
> >>> AUTHORS Eilers,H., Pernthaler,J., Peplies,J., Glockner,F.O.,
> >>> Gerdts,G.
> >>> and
> >>> Amann,R.
> >>> TITLE Isolation of novel pelagic bacteria from the German bight and
> >>> their
> >>> seasonal contributions to surface picoplankton
> >>> JOURNAL Appl. Environ. Microbiol. 67 (11), 5134-5142 (2001)
> >>> MEDLINE 21536174
> >>> PUBMED 11679337
> >>> REFERENCE 2 (bases 1 to 1389)
> >>> AUTHORS Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
> >>
> >> Gerdts,G.,
> >>
> >>> Schuett,C. and Amann,R.
> >>> TITLE Identification and seasonal dominance of culturable marine
> >>> bacteria
> >>> JOURNAL Unpublished
> >>> REFERENCE 3 (bases 1 to 1389)
> >>> AUTHORS Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
> >>
> >> Gerdts,G.,
> >>
> >>> Schuett,C. and Amann,R.
> >>> TITLE Direct Submission
> >>> JOURNAL Submitted (29-AUG-2000) Molecular Ecology,
> >>
> >> Max-Planck-Institute,
> >>
> >>> Celsiusstrasse 1, Bremen 28359, Germany
> >>> FEATURES Location/Qualifiers
> >>> source 1..1389
> >>> /organism="unknown marine gamma proteobacterium
> >>> NOR5"
> >>> /mol_type="genomic DNA"
> >>> /db_xref="taxon:145658"
> >>> rRNA <1..>1389
> >>> /product="16S ribosomal RNA"
> >>> BASE COUNT 343 a 319 c 453 g 274 t
> >>> ORIGIN
> >>> 1 cgcgaaagta cttcggtatg agtagagcgg cggacgggtg agtaacgcgt
> >>> aggaatctat
> >>> 61 ccagtagtgg gggacaactc ggggaaactc gagctaatac cgcatacgtc
> >>> ctaagggaga
> >>> 121 aagcggggga tcttcggacc tcgcgctatt ggaggagcct gcgttggatt
> >>> agctagttgg
> >>> 181 tggggtaaag gcctaccaag gcgacgatcc atagctggtc tgagaggatg
> >>> atcagccaca
> >>> 241 ccgggactga gacacggccc ggactcctac gggaggcagc agtggggaat
> >>> attgcgcaat
> >>> 301 gggcgaaagc ctgacgcagc catgccgcgt gtgtgaagaa ggccttcggg
> >>> ttgtaaagca
> >>> 361 ctttcaattg ggaagaaagg ttagtagtta ataactgcta gctgtgacat
> >>> tacctttaga
> >>> 421 agaagcaccg gctaactccg tgccagcagc cgcggtaata cggaggtgcg
> >>> agcgttaatc
> >>> 481 ggaattactg ggcgtaaagc gcgcgtaggc ggtctgttaa gtcggatgtg
> >>> aaagccccgg
> >>> 541 gctcaacctg ggaattgcac ccgatactgg ccgactggag tgcgagagag
> >>> ggaggtagaa
> >>> 601 ttccacgtgt agcggtgaaa tgcgtagata tgtggaggaa taccggtggc
> >>> gaaggcggcc
> >>> 661 tcctggctcg acactgacgc tgaggtgcga aagcgtgggg agcaaacagg
> >>> attagatacc
> >>> 721 ctggtagtcc acgccgtaaa cgatgtctac tagccgttgg gagacttgat
> >>> ttcttggtgg
> >>> 781 cgaagttaac gcgataagta gaccgcctgg ggagtacggc cgcaaggtta
> >>> aaactcaaat
> >>> 841 gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgatg
> >>> caacgcgaag
> >>> 901 aaccttacca ggccttgaca tcctaggaat cctgtagaga tacgggagtg
> >>> ccttcgggaa
> >>> 961 tctagtgaca ggtgctgcat ggctgtcgtc agctcgtgtc gtgagatgtt
> >>> gggttaagtc
> >>> 1021 ccgtaacgag cgcaaccctt gtccttagtt gccagcgcgt aatggcggga
> >>> actctaagga
> >>> 1081 gactgccggt gacaaaccgg aggaaggtgg ggacgacgtc aagtcatcat
> >>> ggcccttacg
> >>> 1141 gcctgggcta cacacgtgct acaatggaac gcacagaggg cagcaaaccc
> >>> gcgaggggga
> >>> 1201 gcgaatccca caaaacgttt cgtagtccgg atcggagtct gcaactcgac
> >>> tccgtgaagt
> >>> 1261 cggaatcgct agtaatcgtg aatcagaatg tcacggtgaa tacgttcccg
> >>> ggccttgtac
> >>> 1321 acaccgcccg tcacaccatg ggagtgggtt gctccagaag tggttagcct
> >>> aaccttcggg
> >>> 1381 agggcgatc
> >>> //
> >>>
> >>>
> >>>
> >>> ------- You are receiving this mail because: -------
> >>> You are the assignee for the bug, or are watching the assignee.
> >>> _______________________________________________
> >>> Bioperl-guts-l mailing list
> >>> Bioperl-guts-l at portal.open-bio.org
> >>> http://portal.open-bio.org/mailman/listinfo/bioperl-guts-l
> >>
> >> --
> >> "I have not failed. I've just found 10,000 ways that don't work."
> >> --- Thomas Edison
> >>
> >> Nematode Bioinformatics ||
> >> Blaxter Nematode Genomics Group ||
> >> School of Biological Sciences ||
> >> Ashworth Laboratories ||
> >> King's Buildings || tel: +44 131 650 7403
> >> University of Edinburgh || web: www.nematodes.org
> >> Edinburgh ||
> >> EH9 3JT ||
> >> UK ||
--
______ _/ _/_____________________________________________________
_/ _/ http://www.ebi.ac.uk/mutations/
_/ _/ _/ Heikki Lehvaslaiho heikki_at_ebi ac uk
_/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
_/ _/ _/ Wellcome Trust Genome Campus, Hinxton
_/ _/ _/ Cambs. CB10 1SD, United Kingdom
_/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________
More information about the Bioperl-l
mailing list