[Bioperl-l] [Gmod-schema] Loading NCBI/GenBank bacteria into CHADO: Chromosome/Plasmid gene name conflicts

Tue Mar 2 16:11:13 UTC 2010

Hi Leighton,

Wow, that is a lot of text; I really appreciate your thoroughness in
describing the problem.  I have a few suggestions to get the ball
rolling.  First, I am working on the 1.1 release of gmod/chado, and it
may fix some of the problems you are describing.  Certainly, ID
collisions between GFF files should not be a problem (I didn't think
they were in the 1.0 release, but that was a long time ago).  Please
try a checkout of the schema trunk in the gmod svn:

  http://gmod.org/wiki/SVN

Another thing you may want to look at is that just last week, a
developer at Texas A&M, Nathan Liles, contributed code to the
bioperl-live trunk for the genbank2gff3.pl script that will do a much
better job of converting bacterial genbank files to GFF3; perhaps that
will help too.  Working with a svn checkout of bioperl-live shouldn't
be too scary either; the pieces you are interested in (that work with
Chado and GBrowse) are quite stable.

Let us know how it goes,
Scott

On Mon, Mar 1, 2010 at 6:32 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> Hi,
>
> I've tried going back through the mailing list, Googling the answer, and
> reading the documentation and wiki to find a solution for this.  I've either
> missed it, or it's not there yet.  Hopefully there's a simple solution, or
> an option that I'm just not seeing.  I'm sure other people must be using
> CHADO for bacterial genomes, and I would be interested in hearing about best
> practice for using CHADO/GBROWSE with these sequences (I've seen
> http://gmod.org/wiki/Chado_for_prokaryotes - but there's not much in
> there...).
>
> I have a working CHADO(GMOD-1.0)/GBROWSE2/BioPerl 1.6.1 setup on CentOS 5.4,
> and I'm trying to load some bacterial data.  Specifically for this example,
> I'm trying to get the GenBank sequences for E.coli S88: NC_011742 and
> NC_011747 into CHADO.  I've been following instructions from a number of
> locations, including http://gmod.org/wiki/Artemis-Chado_Integration_Tutorial
> and http://gmod.org/wiki/Chado_Tutorial, but there's an issue with these two
> files, in that the NC_011742 (chromosome) and NC_011747 (plasmid) sequences
> contain genes that have the same names (and several genes with the same name
> in the same sequence!), and this appears to be a problem.  Here's what's
> going wrong:
>
> I start off with the two GenBank files:
>
> """
> [lpritc at localhost ~]$ ls -1 *.gbk
> NC_011742.gbk
> NC_011747.gbk
> """
>
> And convert these to .gff3 using the BioPerl script (it doesn't seem to
> matter whether I pass them with the wildcard, or convert separately, though
> passing multiple sequences for conversion might be a good place to check for
> unique IDs):
>
> """
> [lpritc at localhost ~]$ bp_genbank2gff3.pl -s *.gbk
> # Input: NC_011742.gbk
> # working on region:NC_011742, Escherichia coli S88, 19-DEC-2008,
> Escherichia coli S88, complete genome.
> # GFF3 saved to ./NC_011742.gbk.gff
> # Summary:
> # Feature    Count
> # -------    -----
> # mRNA  4696
> # gene  4898
> # region  1
> # pseudogene  151
> # CDS  4696
> # RESIDUES(tr)  1442813
> # RESIDUES  5032268
> # processed_transcript  89
> # rRNA  22
> # pseudogenic_region  151
> # exon  4899
> # tRNA  91
> #
> # Input: NC_011747.gbk
> # working on region:NC_011747, Escherichia coli S88, 18-AUG-2009,
> Escherichia coli S88 plasmid pECOS88, complete sequence.
> # GFF3 saved to ./NC_011747.gbk.gff
> # Summary:
> # Feature    Count
> # -------    -----
> # mRNA  4832
> # gene  5037
> # region  2
> # pseudogene  159
> # CDS  4832
> # RESIDUES(tr)  1477756
> # RESIDUES  5166121
> # processed_transcript  92
> # rRNA  22
> # pseudogenic_region  159
> # exon  5038
> # tRNA  91
> #
> """
>
> I can then use the gmod_bulk_load_gff3.pl script to load either file, but
> only singly.  This appears to work, and the result is visible and seemingly
> correctly navigable in GBROWSE (using NC_011747 as the first sequence here,
> but the order is unimportant):
>
> """
> [lpritc at localhost ~]$ gmod_bulk_load_gff3.pl --organism E.coli --dbxref
> GeneID --noexon --recreate_cache --gfffile NC_011747.gbk.gff
> (Re)creating the uniquename cache in the database...
> Creating table...
> Populating table...
> Creating indexes...Done.
> Preparing data for inserting into the chado database
> (This may take a while ...)
> Dropping cds temp tables...
> Creating cds temp tables...
> NOTICE:  CREATE TABLE will create implicit sequence
> "tmp_cds_handler_cds_row_id_seq" for serial column
> "tmp_cds_handler.cds_row_id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "tmp_cds_handler_pkey" for table "tmp_cds_handler"
> NOTICE:  CREATE TABLE will create implicit sequence
> "tmp_cds_handler_relationship_rel_row_id_seq" for serial column
> "tmp_cds_handler_relationship.rel_row_id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "tmp_cds_handler_relationship_pkey" for table "tmp_cds_handler_relationship"
> Loading data into feature table ...
> Loading data into featureloc table ...
> Loading data into feature_relationship table ...
> Loading data into featureprop table ...
> Skipping feature_cvterm table since the load file is empty...
> Skipping synonym table since the load file is empty...
> Skipping feature_synonym table since the load file is empty...
> Skipping dbxref table since the load file is empty...
> Loading data into feature_dbxref table ...
> Skipping analysisfeature table since the load file is empty...
> Skipping cvterm table since the load file is empty...
> Skipping db table since the load file is empty...
> Skipping cv table since the load file is empty...
> Skipping analysis table since the load file is empty...
> Skipping organism table since the load file is empty...
> Adding cvtermprop=MapReferenceType for 'region' ...
> Loading sequences (if any) ...
> Optimizing database (this may take a while) ...
>  (feature featureloc feature_relationship featureprop feature_cvterm
> synonym feature_synonym dbxref feature_dbxref analysisfeature cvterm db cv
> analysis organism ) Done.
>
> While this script has made an effort to optimize the database, you
> should probably also run VACUUM FULL ANALYZE on the database as well
> """
>
> """
> chado=> SELECT feature_id, organism_id, name, uniquename FROM feature WHERE
> name='NC_011747';
>  feature_id | organism_id |   name    | uniquename
> ------------+-------------+-----------+------------
>     146917 |          99 | NC_011747 | NC_011747
> """
>
> However, attempting to load in the second sequence throws an error (though
> this might also be a good point to check for ID uniqueness with a database
> check, and appropriate modification to the ID, if necessary - problems could
> arise if we were trying to add genuine duplicates, though...):
>
> """
> [lpritc at localhost ~]$ gmod_bulk_load_gff3.pl --organism E.coli --dbxref
> GeneID --noexon --recreate_cache --gfffile NC_011742.gbk.gff
> (Re)creating the uniquename cache in the database...
> Creating table...
> Populating table...
> Creating indexes...Done.
> Preparing data for inserting into the chado database
> (This may take a while ...)
> Dropping cds temp tables...
> Creating cds temp tables...
> NOTICE:  CREATE TABLE will create implicit sequence
> "tmp_cds_handler_cds_row_id_seq" for serial column
> "tmp_cds_handler.cds_row_id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "tmp_cds_handler_pkey" for table "tmp_cds_handler"
> NOTICE:  CREATE TABLE will create implicit sequence
> "tmp_cds_handler_relationship_rel_row_id_seq" for serial column
> "tmp_cds_handler_relationship.rel_row_id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "tmp_cds_handler_relationship_pkey" for table "tmp_cds_handler_relationship"
>
> no parent yacC;
> you probably need to rerun the loader with the --recreate_cache option
>
> Issuing rollback() due to DESTROY without explicit disconnect() of
> DBD::Pg::db handle dbname=chado;port=5432;host=localhost.
> """
>
> This, of course, prevents the upload of the sequence and its annotations, as
> a whole.
>
> The script recommends that the --recreate_cache option should be used, but I
> am already using it.  If the same process is run, reversing the order of the
> input files, the same error is reported, but for the gene with name 'int'.
> Both sequences contain genes with the names 'int' and 'yacC' (NC_011742
> appears to contain four genes with the name 'int'):
>
> """
> [lpritc at localhost ~]$ grep 'ID=yacC;' *.gbk.gff
> NC_011742.gbk.gff:NC_011742    GenBank    gene    142755    143273    .    -
> .    ID=yacC;Dbxref=GeneID:7130628;gene=yacC;locus_tag=ECS88_0131
> NC_011747.gbk.gff:NC_011747    GenBank    gene    85083    85931    .    +
> .    ID=yacC;Dbxref=GeneID:7119486;gene=yacC;locus_tag=pECS88_0103
>
> [lpritc at localhost ~]$ grep 'ID=int;' *.gbk.gff
> NC_011742.gbk.gff:NC_011742    GenBank    gene    1182443    1183585    .
> -    .    ID=int;Dbxref=GeneID:7131611;gene=int;locus_tag=ECS88_1152
> NC_011742.gbk.gff:NC_011742    GenBank    pseudogene    1998684    1999646
> .    +    .
> ID=int;Dbxref=GeneID:7128964;gene=int;locus_tag=ECS88_2031;pseudo=_no_value
> NC_011742.gbk.gff:NC_011742    GenBank    gene    2829972    2830991    .
> +    .    ID=int;Dbxref=GeneID:7131911;gene=int;locus_tag=ECS88_2851
> NC_011742.gbk.gff:NC_011742    GenBank    gene    3220074    3221336    .
> +    .    ID=int;Dbxref=GeneID:7129893;gene=int;locus_tag=ECS88_3250
> NC_011747.gbk.gff:NC_011747    GenBank    gene    132    872    .    +    .
> ID=int;Dbxref=GeneID:7119360;gene=int;locus_tag=pECS88_0001
> """
>
> Commenting out either of these genes, and their child features, defers the
> error to another gene that has the same name in both sequences in each case.
> It seems that the problem might derive from attempting to uniquely associate
> each gene uniquely with its 'gene' tag in the GenBank file and, as there are
> several points in the process where it would be sensible to check for name
> collisions, so that the feature:uniquename column can be modified to reflect
> this, I looked for command-line options to each script, but didn't see one
> that could help.  Examining the manual for gmod_bulk_load_gff3.pl suggests
> that this might be the problem (though I might be misunderstanding it):
>
> """
>       Column 9 (group)
>           Here is where the magic happens.
>
>           Assigning feature.name, feature.uniquename
>               The values of feature.name and feature.uniquename are
> assigned according to these simple rules:
>
>               If there is an ID tag, that is used as feature.uniquename
>                   otherwise, it is assigned a uniquename that is equal to
> ¹auto¹ concatenated with the feature_id.
>
>                   (Note that this is a potential problem as there is no
> check to make sure that it is appropriately unique.)
>
>               If there is a Name tag, it¹s value is set to feature.name;
>                   otherwise it is null.
>
>                   Note that these rules are much more simple than that
> those that Bio::DB::GFF uses, and may need to be revisited.
> """
>
> I suspect that, as the bp_genbank2gff3.pl script converts gene names (which
> are not guaranteed to be unique) to ID tags, the problem recognised in the
> manual is cropping up at this point.  Luckily, the GenBank files come with
> locus_tag tags, which should be unique for each gene (see
> http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#locus_tag).  For
> bacteria, at least, using the locus_tag values might be a more robust option
> for the bp_genbank2gff3.pl; this already appears to have been recognised in
> the script comments:
>
> """
>            #?? should gene_name from
> /locus_tag,/gene,/product,/transposon=xxx
>            # be converted to or added as  Name=xxx (if not ID= or as well)
>            ## problematic: convert_to_name ($feature); # drops
> /locus_tag,/gene, tags
> """
>
> I can get round the upload problem somewhat suckily by changing the priority
> given to 'locus_tag' and 'gene' tags for generating the .gff ID tag in the
> bp_genbank2gff3.pl script:
>
> """
> [lpritc at localhost ~]$ diff bp_genbank2gff3.pl /usr/bin/bp_genbank2gff3.pl
> 976,977c976,977
> <     if ($g->has_tag('locus_tag')) {
> <         ($gene_id) = $g->get_tag_values('locus_tag');
> ---
>>     if ($g->has_tag('gene')) {
>>         ($gene_id) = $g->get_tag_values('gene');
> 979,980c979,980
> <     elsif ($g->has_tag('gene')) {
> <         ($gene_id) = $g->get_tag_values('gene');
> ---
>>     elsif ($g->has_tag('locus_tag')) {
>>         ($gene_id) = $g->get_tag_values('locus_tag');
> """
>
> But this isn't a complete solution, as GBROWSE searches by gene name don't
> work after making this change, and presumably some further configuration or
> hacking about is required to sort that out (advice welcome).
>
> So, what are other people doing to overcome this issue (if you've seen it),
> and would a change to the bp_genbank2gff.pl script along the lines I mention
> be useful to others?
>
> Cheers,
>
> L.
>
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Gmod-schema mailing list
> Gmod-schema at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>

-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research