[Bioperl-l] Re: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid

James Wasmuth james.wasmuth at ed.ac.uk
Sat Mar 13 10:34:01 EST 2004


Brian,

>Yes, I could read your patch but I'm lazy. You said:
>  
>
>>>create a Bio::Species object, but the genus=unknown species=marine
>>>      
>>>
>subspecies=gamma.
>
I was highlighting the problem my patch would create.  I hadn't thought 
too hard about its consequences, but realise that it may have some 
knock-on effects.

>
>Shouldn't the values be the same for all these "species" for which the genus
>is not known? Like:
>
>Genus=unknown, species=unknown, subspecies=unknown
>
>That way you can check, since one can no longer use "unless defined
>$species_object" to see if real species information is lacking or not. Have
>I missed something here?
>

NCBI taxonomy considers the term 'unknown marine gamma proteobacterium 
NOR5' to represent a species, though for this example there remaining 
taxonomy classification is awarded no rank until 'class'. 

So one possible fix, would be:

if (ORGANISM ne "synonym for taxid 32644")    {
    then add rest of name into $species.
}

therefore $genus = unknown  and $species = marine gamma proteobacterium NOR5

other problems are organisms such as "leaf litter basidiomycete sp. 
Collb2-39".   Currently $genus = leaf, $species = litter and $subsp = 
basidiomycete.  Perhaps $subsp should contain everything left over?  
Thoughts?  Also does anyone know off hand if it copes with 'varietas' 
and 'var.' for plants?  I expect not.  I  will have a look at the  
genbank.pm on Monday and suggest a patch.  I expect these issues will 
also be pertinent in embl.pm and other database format modules...


hmmmm, anyone with thoughts?

-james







>
>Brian O.
>
>
>-----Original Message-----
>From: James Wasmuth [mailto:james.wasmuth at ed.ac.uk]
>Sent: Thursday, March 11, 2004 9:40 AM
>To: Brian Osborne
>Cc: bioperl-guts-l at bioperl.org
>Subject: Re: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
>
>Brian and all at bioperl-guts,
>
>
>below is the comment I've added to the bug[1600].  I think it may need
>some discussion, but the patch I've added works to the extent that it
>allows creation of a Bio::Species object but the subsequent genus,
>species, subspecies calls will be 'wrong'.  Personally I'm more
>concerned with the taxid, which I think will be sufficient.
>
>If you want to see the size of this problem go to NCBI taxonomy and
>enter the term identified as a token set!  I think that maintaining the
>taxid is enough, otherwise the artifical split of terms such as
>**unidentified diatom endosymbiont of Peridinium foliaceum*
><http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=42247
>&lvl=3&lin=f&keep=1&srchmode=3&unlock>*
>may be a problem, though some of them are intuitive.
>
>One last question, I've never tried to fix a bug before, so I've
>commited a patch as an attachment to Bugzilla for the bug.  Do others
>check this and if okay place it in the code...
>apologies for the newbie bit...
>
>-james
>
>
>
>genbank.pm
>
>line 1123: return unless $genus and  $genus !~ /^(Unknown|None)$/oi;
>
>a number of species are described as Unknown blah blah blah.
>
>The NCBI taxid assigned to unknown taxa is 32644 and has a number of
>synonyms, none of which are 'unknown'.
>
>The list includes: other, unknown organism, not specified, not shown,
>unspecified, Unknown, None, unclassified , unidentified organism
>
>I've changed the _read_GenBank_Species subroutine to allow organism
>names such as 'unknown marine gamma proteobacterium NOR5'.  This will
>create a Bio::Species object, but the genus=unknown species=marine
>subspecies=gamma.
>
>There is a whole host of species names that ignore the nice rules in
>_read_GenBank_Species!  However this fix will allow the correct taxid to
>be provided which I think is more than the name!
>
>
>
>sub _read_GenBank_Species {
>    my( $self,$buffer) = @_;
>    my @organell_names = ("chloroplast", "mitochondr");
>     # only those carrying DNA, apart from the nucleus
>
>    #CHANGE
>     my @unkn_names=("other", 'unknown organism', 'not specified', 'not
>shown', 'Unspecified', 'Unknown', 'None', 'unclassified', 'unidentified
>organism');
>
>    $_ = $$buffer;
>
>    my( $sub_species, $species, $genus, $common, $organelle, @class,
>$ns_name );
>    # upon first entering the loop, we must not read a new line -- the
>SOURCE
>    # line is already in the buffer (HL 05/10/2000)
>    while (defined($_) || defined($_ = $self->_readline())) {
>    # de-HTMLify (links that may be encountered here don't contain
>    # escaped '>', so a simple-minded approach suffices)
>        s/<[^>]+>//g;
>    if (/^SOURCE\s+(.*)/o) {
>        # FIXME this is probably mostly wrong (e.g., it yields things like
>        # Homo sapiens adult placenta cDNA to mRNA
>        # which is certainly not what you want)
>        $common = $1;
>        $common =~ s/\.$//; # remove trailing dot
>    } elsif (/^\s{2}ORGANISM/o) {
>        my @spflds = split(' ', $_);
>            ($ns_name) = $_ =~ /\w+\s+(.*)/o;
>        shift(@spflds); # ORGANISM
>
>         if(grep { $_ =~ /^$spflds[0]/i; } @organell_names) {
>        $organelle = shift(@spflds);
>        }
>            $genus = shift(@spflds);
>        if(@spflds) {
>        $species = shift(@spflds);
>        } elsif ( grep { $genus } @unkn_names){
>        $species = '';
>        } else {$species='sp.';}      #there's no species name but it
>isn't unclassified
>        $sub_species = shift(@spflds) if(@spflds);
>        } elsif (/^\s+(.+)/o) {
>        # only split on ';' or '.' so that
>        # classification that is 2 words will
>        # still get matched
>        # use map to remove trailing/leading spaces
>            push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/,
>$1);
>        } else {
>            last;
>        }
>
>        $_ = undef; # Empty $_ to trigger read of next line
>    }
>
>     $$buffer = $_;
>
>     # Don't make a species object if it's empty or "Unknown" or "None"
>    my $unkn = grep { $_ =~ /^$genus$species/i; } @unkn_names;
>
>     return unless $genus and  $unkn==0;
>
>     # Bio::Species array needs array in Species -> Kingdom direction
>    if ($class[0] eq 'Viruses') {
>        push( @class, $ns_name );
>    }
>    elsif ($class[$#class] eq $genus) {
>        push( @class, $species );
>    } else {
>        push( @class, $genus, $species );
>    }
>    @class = reverse @class;
>
>    my $make = Bio::Species->new();
>    $make->classification( \@class, "FORCE" ); # no name validation please
>    $make->common_name( $common      ) if $common;
>    unless ($class[-1] eq 'Viruses') {
>        $make->sub_species( $sub_species ) if $sub_species;
>    }
>    $make->organelle($organelle) if $organelle;
>    return $make;
>}
>
>
>
>
>Brian Osborne wrote:
>
>  
>
>>James,
>>
>>Your guess is right, no Species is made because of the name. That's because
>>genbank.pm normally looks at:
>>
>>ORGANISM Bos taurus
>>
>>And makes "Bos" the genus, and so on.
>>
>>If it sees:
>>
>>ORGANISM Unknown
>>
>>It refuses to make a Species object, and it's interpreting your ORGANISM
>>line in the same way because it can't make a valid genus, that's the
>>    
>>
>current
>  
>
>>rule. Personally I'd say that I agree with its principle - how can we make
>>    
>>
>a
>  
>
>>Species object without genus and species?
>>
>>You can get the taxid from a SeqFeature object, you already knew that.
>>
>>Brian O.
>>
>>
>>-----Original Message-----
>>From: bioperl-guts-l-bounces at portal.open-bio.org
>>[mailto:bioperl-guts-l-bounces at portal.open-bio.org]On Behalf Of
>>bugzilla-daemon at portal.open-bio.org
>>Sent: Thursday, March 11, 2004 4:21 AM
>>To: bioperl-guts-l at bioperl.org
>>Subject: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
>>
>>http://bugzilla.bioperl.org/show_bug.cgi?id=1600
>>
>>          Summary: $gb->species->ncbi_taxid
>>          Product: Bioperl
>>          Version: unspecified
>>         Platform: PC
>>       OS/Version: Linux
>>           Status: NEW
>>         Severity: normal
>>         Priority: P2
>>        Component: Bio::SeqIO
>>       AssignedTo: bioperl-guts-l at bioperl.org
>>       ReportedBy: james.wasmuth at ed.ac.uk
>>
>>
>>I've included a genbank file for which I have been unable to extract the
>>ncbi_taxid for using
>>
>>$gb->species->ncbi_taxid
>>
>>the error is:
>>Can't call method "ncbi_taxid" on an undefined value
>>
>>infact I don't get a Bio::Species object.  I'm sure its because of the
>>    
>>
>name,
>  
>
>>which is correct.
>>
>>I've tried looking into it, but could not find which Seq object creates the
>>Bio::Species object.
>>
>>
>>
>>LOCUS       AY007676                1389 bp    DNA     linear   BCT
>>29-OCT-2001
>>DEFINITION  Unknown marine gamma proteobacterium NOR5 16S ribosomal RNA,
>>           partial sequence.
>>ACCESSION   AY007676
>>VERSION     AY007676.1  GI:12000362
>>KEYWORDS    .
>>SOURCE      unknown marine gamma proteobacterium NOR5
>> ORGANISM  unknown marine gamma proteobacterium NOR5
>>           Bacteria; Proteobacteria; Gammaproteobacteria.
>>REFERENCE   1  (bases 1 to 1389)
>> AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Glockner,F.O., Gerdts,G.
>>and
>>           Amann,R.
>> TITLE     Isolation of novel pelagic bacteria from the German bight and
>>their
>>           seasonal contributions to surface picoplankton
>> JOURNAL   Appl. Environ. Microbiol. 67 (11), 5134-5142 (2001)
>> MEDLINE   21536174
>>  PUBMED   11679337
>>REFERENCE   2  (bases 1 to 1389)
>> AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
>>    
>>
>Gerdts,G.,
>  
>
>>           Schuett,C. and Amann,R.
>> TITLE     Identification and seasonal dominance of culturable marine
>>bacteria
>> JOURNAL   Unpublished
>>REFERENCE   3  (bases 1 to 1389)
>> AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
>>    
>>
>Gerdts,G.,
>  
>
>>           Schuett,C. and Amann,R.
>> TITLE     Direct Submission
>> JOURNAL   Submitted (29-AUG-2000) Molecular Ecology,
>>    
>>
>Max-Planck-Institute,
>  
>
>>           Celsiusstrasse 1, Bremen 28359, Germany
>>FEATURES             Location/Qualifiers
>>    source          1..1389
>>                    /organism="unknown marine gamma proteobacterium NOR5"
>>                    /mol_type="genomic DNA"
>>                    /db_xref="taxon:145658"
>>    rRNA            <1..>1389
>>                    /product="16S ribosomal RNA"
>>BASE COUNT      343 a    319 c    453 g    274 t
>>ORIGIN
>>       1 cgcgaaagta cttcggtatg agtagagcgg cggacgggtg agtaacgcgt aggaatctat
>>      61 ccagtagtgg gggacaactc ggggaaactc gagctaatac cgcatacgtc ctaagggaga
>>     121 aagcggggga tcttcggacc tcgcgctatt ggaggagcct gcgttggatt agctagttgg
>>     181 tggggtaaag gcctaccaag gcgacgatcc atagctggtc tgagaggatg atcagccaca
>>     241 ccgggactga gacacggccc ggactcctac gggaggcagc agtggggaat attgcgcaat
>>     301 gggcgaaagc ctgacgcagc catgccgcgt gtgtgaagaa ggccttcggg ttgtaaagca
>>     361 ctttcaattg ggaagaaagg ttagtagtta ataactgcta gctgtgacat tacctttaga
>>     421 agaagcaccg gctaactccg tgccagcagc cgcggtaata cggaggtgcg agcgttaatc
>>     481 ggaattactg ggcgtaaagc gcgcgtaggc ggtctgttaa gtcggatgtg aaagccccgg
>>     541 gctcaacctg ggaattgcac ccgatactgg ccgactggag tgcgagagag ggaggtagaa
>>     601 ttccacgtgt agcggtgaaa tgcgtagata tgtggaggaa taccggtggc gaaggcggcc
>>     661 tcctggctcg acactgacgc tgaggtgcga aagcgtgggg agcaaacagg attagatacc
>>     721 ctggtagtcc acgccgtaaa cgatgtctac tagccgttgg gagacttgat ttcttggtgg
>>     781 cgaagttaac gcgataagta gaccgcctgg ggagtacggc cgcaaggtta aaactcaaat
>>     841 gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgatg caacgcgaag
>>     901 aaccttacca ggccttgaca tcctaggaat cctgtagaga tacgggagtg ccttcgggaa
>>     961 tctagtgaca ggtgctgcat ggctgtcgtc agctcgtgtc gtgagatgtt gggttaagtc
>>    1021 ccgtaacgag cgcaaccctt gtccttagtt gccagcgcgt aatggcggga actctaagga
>>    1081 gactgccggt gacaaaccgg aggaaggtgg ggacgacgtc aagtcatcat ggcccttacg
>>    1141 gcctgggcta cacacgtgct acaatggaac gcacagaggg cagcaaaccc gcgaggggga
>>    1201 gcgaatccca caaaacgttt cgtagtccgg atcggagtct gcaactcgac tccgtgaagt
>>    1261 cggaatcgct agtaatcgtg aatcagaatg tcacggtgaa tacgttcccg ggccttgtac
>>    1321 acaccgcccg tcacaccatg ggagtgggtt gctccagaag tggttagcct aaccttcggg
>>    1381 agggcgatc
>>//
>>
>>
>>
>>------- You are receiving this mail because: -------
>>You are the assignee for the bug, or are watching the assignee.
>>_______________________________________________
>>Bioperl-guts-l mailing list
>>Bioperl-guts-l at portal.open-bio.org
>>http://portal.open-bio.org/mailman/listinfo/bioperl-guts-l
>>
>>
>>
>>
>>    
>>
>
>--
>"I have not failed. I've just found 10,000 ways that don't work."
>               --- Thomas Edison
>
>Nematode Bioinformatics           ||
>Blaxter Nematode Genomics Group   ||
>School of Biological Sciences     ||
>Ashworth Laboratories             ||
>King's Buildings                  ||    tel: +44 131 650 7403
>University of Edinburgh           ||    web: www.nematodes.org
>Edinburgh                         ||
>EH9 3JT                           ||
>UK                                ||
>
>
>  
>



More information about the Bioperl-l mailing list