[Bioperl-l] bp_genbank2gff3.pl
Scott Cain
scott at scottcain.net
Sat Sep 18 14:03:43 UTC 2010
The only thing I can add is that I did a 'git diff genbank2gff3.PLS'
and found no differences. It occurred to me that perhaps I'd done
some fixing and not commited it, but it looks to me that that's not
the case (assuming I've managed to use git correctly (not a great
assumption, but I don't have another one to work with :-))
Scott
On Sat, Sep 18, 2010 at 2:57 PM, David Breimann
<david.breimann at gmail.com> wrote:
> So let's do an intermediate summary of my situation:
> I'm using Ubuntu 10.04 and Perl 5.10.1.
> I get unexpected results when using bp_genbank2gff3.pl ("Name=" instead of
> "locus_tag=" in the last GFF3 column), while Scott gets the expected results
> while using the latest version of bioperl.
> I cloned a fresh version of bioperl live into my ~/src:
> $ cd ~/src
> $ git clone http://github.com/bioperl/bioperl-live.git
>
> I then added the following line to the end of ~/.profile:
> export PERL5LIB="$HOME/src/bioperl-live:$PERL5LIB"
> and ran
> $ source ~/.profile
>
> I then downloaded a small genome from NCBI
> $ wget
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
> and tested the script:
> $ ~/src/bioperl-live/scripts/Bio-DB-GFF/genbank2gff3.PLS NC_009789.gbk
>
> Following are the top 10 lines of the resulting GFF3:
>
> ##gff-version 3
> # sequence-region NC_009789 1 6199
> # conversion-by bp_genbank2gff3.pl
> # organism Escherichia coli E24377A
> # date 06-JAN-2010
> # Note Escherichia coli E24377A plasmid pETEC_6, complete sequence.
> NC_009789 GenBank region 1 6199 . + 1
> ID=NC_009789;Dbxref=Project:13960,taxon:331111;Name=NC_009789;Note=Escherichia
> coli E24377A plasmid pETEC_6%2C complete sequence.,PROVISIONAL REFSEQ: This
> record has not yet been subject to final NCBI review. The reference sequence
> was derived from CP000798. Source DNA and bacteria available from Jacques
> Ravel (jravel at tigr.org). COMPLETENESS: full length. ;comment1=PROVISIONAL
> REFSEQ: This record has not yet been subject to final NCBI review. The
> reference sequence was derived from CP000798. Source DNA and bacteria
> available from Jacques Ravel (jravel at tigr.org). COMPLETENESS: full length.
> ;date=06-JAN-2010;mol_type=genomic DNA;organism=Escherichia coli
> E24377A;plasmid=pETEC_6;strain=E24377A
> NC_009789 GenBank gene 665 781 . - 1
> ID=EcE24377A_B0001;Dbxref=GeneID:5585816;Name=EcE24377A_B0001
> NC_009789 GenBank mRNA 665 781 . - 1
> ID=EcE24377A_B0001.t01;Parent=EcE24377A_B0001
> NC_009789 GenBank CDS 665 781 . - 1
> ID=EcE24377A_B0001.p01;Parent=EcE24377A_B0001.t01;Dbxref=GI:157149501,GeneID:5585816;Name=EcE24377A_B0001;Note=identified
> by glimmer%3B putative;codon_start=1;product=hypothetical
> protein;protein_id=YP_001451539.1;transl_table=11;translation=length.38
>
> while these are from Scotts' file:
> ##gff-version 3
> # sequence-region NC_009789 1 6199
> # conversion-by bp_genbank2gff3.pl
> # organism Escherichia coli E24377A
> # date 06-JAN-2010
> # Note Escherichia coli E24377A plasmid pETEC_6, complete sequence.
> NC_009789 GenBank region 1 6199 . + 1
> ID=NC_009789;Dbxref=Project:13960,taxon:331111;Note=Escherichia coli E24377A
> plasmid pETEC_6%2C complete sequence.,PROVISIONAL REFSEQ: This record has
> not yet been subject to final NCBI review. The reference sequence was
> derived from CP000798. Source DNA and bacteria available from Jacques Ravel
> (jravel at tigr.org). COMPLETENESS: full length. ;comment1=PROVISIONAL REFSEQ:
> This record has not yet been subject to final NCBI review. The reference
> sequence was derived from CP000798. Source DNA and bacteria available from
> Jacques Ravel (jravel at tigr.org). COMPLETENESS: full length.
> ;date=06-JAN-2010;mol_type=genomic DNA;organism=Escherichia coli
> E24377A;plasmid=pETEC_6;strain=E24377A
> NC_009789 GenBank gene 665 781 . - 1
> ID=EcE24377A_B0001;Dbxref=GeneID:5585816;locus_tag=EcE24377A_B0001
> NC_009789 GenBank mRNA 665 781 . - 1
> ID=EcE24377A_B0001.t01;Parent=EcE24377A_B0001
> NC_009789 GenBank CDS 665 781 . - 1
> ID=EcE24377A_B0001.p01;Parent=EcE24377A_B0001.t01;Dbxref=GI:157149501,GeneID:5585816;Note=identified
> by glimmer%3B
> putative;codon_start=1;locus_tag=EcE24377A_B0001;product=hypothetical
> protein;protein_id=YP_001451539.1;transl_table=11;translation=length.38
>
>
> Note the "Name=" tags in my version are replaced by "locus_tag=" in Scott's,
> as desired.
> I have no idea what is going on here...
>
> Best,
> Dave
>
> On Sat, Sep 18, 2010 at 3:40 PM, Scott Cain <scott at scottcain.net> wrote:
>>
>> Hi Dave,
>>
>> Let's keep the discussion on the mailing list so we can make sure that
>> when this problem is solved, its resolution will be archived.
>>
>> I don't really understand what is going on either, though it would
>> probably be a good idea to set your PERL5LIB env variable so that when
>> you execute this script from the git repository that it will also uses
>> BioPerl modules in the git repository instead of the ones that are
>> installed in your "normal" path.
>>
>> Also, are you using any command line flags when executing it? I didn't.
>>
>> Scott
>>
>>
>> On Sat, Sep 18, 2010 at 2:14 PM, David Breimann
>> <david.breimann at gmail.com> wrote:
>> > Yes, I'm using Ubuntu 10.04.
>> >
>> > That is really weired. I tried running the script from the perl-live dir
>> > (which I just pulled using git), and I get the same results as before
>> > (`Name` instead of `locus_tag`):
>> >
>> > $ wget
>> >
>> > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
>> > $ /home/dave/src/bioperl-live/blib/script/bp_genbank2gff3.pl -y
>> > NC_009789.genbank
>> >
>> > Attached is the resulting GFF3.
>> > I also attach a copy of bp_genbank2gff3.pl as found under
>> > /home/dave/src/bioperl-live/blib/script.
>> >
>> > This is a real mystery for me!
>> >
>> > On Sat, Sep 18, 2010 at 2:54 PM, Scott Cain <scott at scottcain.net> wrote:
>> >>
>> >> Typically I do build and install, but you can run it directly from the
>> >> git checkout directory.
>> >>
>> >> For locating other versions of the script, are you running linux? If
>> >> so, are you familiar with the "locate" command:
>> >>
>> >> locate bp_genbank2gff3.pl
>> >>
>> >> If you've never used it before, you may need to update the database
>> >> the locate command uses as root:
>> >>
>> >> sudo updatedb
>> >>
>> >> Scott
>> >>
>> >>
>> >> On Sat, Sep 18, 2010 at 1:46 PM, David Breimann
>> >> <david.breimann at gmail.com> wrote:
>> >> > Your gff seems fine. I get a vey similiar one, but with `Name=`
>> >> > instaed
>> >> > of
>> >> > `locus_tag=`.
>> >> >
>> >> > I don't really know how to check for multiple bioperl installations.
>> >> > I'm using my personal server, so I don't mind removing and installing
>> >> > everything from scratch -- but I do'nt know ho to do that.
>> >> >
>> >> > Also, what I don't get with the git is how the scripts are supposed
>> >> > to
>> >> > be
>> >> > updated (unless you build and install).
>> >> >
>> >> > Thanks you!
>> >> >
>> >> > On Sat, Sep 18, 2010 at 2:38 PM, Scott Cain <scott at scottcain.net>
>> >> > wrote:
>> >> >>
>> >> >> Well, if you aren't getting the same results as me then I'd say you
>> >> >> aren't using the same version of the script :-)
>> >> >>
>> >> >> Unfortunately, the scripts are no longer automatically marked with
>> >> >> the
>> >> >> "internal" version information when committed, so there really isn't
>> >> >> anything in the script I can tell you to look for. Check for more
>> >> >> than one bioperl instance on your computer.
>> >> >>
>> >> >> I've attached the GFF3 file I got so you can look at it and tell me
>> >> >> if
>> >> >> it is what you expect.
>> >> >>
>> >> >> Scott
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sat, Sep 18, 2010 at 12:26 PM, David Breimann
>> >> >> <david.breimann at gmail.com> wrote:
>> >> >> > Hi Scott,
>> >> >> >
>> >> >> > I just pulled the lated bioperl-live using git.
>> >> >> > I'm not sure how the scripts are updated, so I Build and installed
>> >> >> > anyway
>> >> >> > (perhaps exporting the path is supposed to be enough?)
>> >> >> > Anyway, I still get the same results. No locus_tag.
>> >> >> > How can I tell if I'm using the latest version of the script?
>> >> >> >
>> >> >> > Thanks again.
>> >> >> >
>> >> >> > On Sat, Sep 18, 2010 at 1:07 PM, Scott Cain <scott at scottcain.net>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Hi Dave,
>> >> >> >>
>> >> >> >> A fresh "pull" of the bioperl git repository shows that
>> >> >> >> bp_genbank2gff3.pl already does this. It creates a locus_tag for
>> >> >> >> all
>> >> >> >> features that have a locus_tag, and uses the locus_tag for the ID
>> >> >> >> when
>> >> >> >> it can (it can't blindly use the locus tag for the ID since both
>> >> >> >> the
>> >> >> >> gene and the CDS have the same tag).
>> >> >> >>
>> >> >> >> Scott
>> >> >> >>
>> >> >> >>
>> >> >> >> On Sat, Sep 18, 2010 at 11:20 AM, David Breimann
>> >> >> >> <david.breimann at gmail.com> wrote:
>> >> >> >> > Hi Scott,
>> >> >> >> >
>> >> >> >> > Here is a very short genbank:
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
>> >> >> >> >
>> >> >> >> > Note all genes in the genbank have locus tags. In the resulting
>> >> >> >> > GFF3,
>> >> >> >> > however, only the last gene (EcE24377A_B0005) gets a locus_tag.
>> >> >> >> > I
>> >> >> >> > have
>> >> >> >> > no
>> >> >> >> > idea why it deserves a special treatment... :)
>> >> >> >> >
>> >> >> >> > p.s. making this change (i.e., copying locus_tag to the GFF3
>> >> >> >> > last
>> >> >> >> > column
>> >> >> >> > whenever available) will really make my life easier.
>> >> >> >> >
>> >> >> >> > Thank you,
>> >> >> >> > Dave
>> >> >> >> >
>> >> >> >> > On Sat, Sep 18, 2010 at 12:08 PM, Scott Cain
>> >> >> >> > <scott at scottcain.net>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> Hi Dave,
>> >> >> >> >>
>> >> >> >> >> That seems perfectly reasonable. If you could point out a
>> >> >> >> >> GenBank
>> >> >> >> >> entry for which that does not happen, I could try to figure
>> >> >> >> >> out
>> >> >> >> >> why
>> >> >> >> >> not.
>> >> >> >> >>
>> >> >> >> >> Scott
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Sat, Sep 18, 2010 at 10:20 AM, David Breimann
>> >> >> >> >> <david.breimann at gmail.com> wrote:
>> >> >> >> >> > Since locus_tag is an essential tag in genbank, I suggest
>> >> >> >> >> > locus_tag
>> >> >> >> >> > will
>> >> >> >> >> > be
>> >> >> >> >> > always added to the GFF last column if it exists in the
>> >> >> >> >> > genbank,
>> >> >> >> >> > whether
>> >> >> >> >> > it
>> >> >> >> >> > is used as ID in the GFF or not.
>> >> >> >> >> >
>> >> >> >> >> > On Sat, Sep 18, 2010 at 11:17 AM, Scott Cain
>> >> >> >> >> > <scott at scottcain.net>
>> >> >> >> >> > wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Hi Dave,
>> >> >> >> >> >>
>> >> >> >> >> >> bp_genbank2gff3.pl suffers from the fact that it has to
>> >> >> >> >> >> deal
>> >> >> >> >> >> with
>> >> >> >> >> >> GenBank files :-) It was designed initially to work on
>> >> >> >> >> >> whole
>> >> >> >> >> >> genome
>> >> >> >> >> >> refseqs, and contains several ad hoc rules for trying to
>> >> >> >> >> >> make
>> >> >> >> >> >> it
>> >> >> >> >> >> "do
>> >> >> >> >> >> the right thing." In practice, it is not unusual for a
>> >> >> >> >> >> post
>> >> >> >> >> >> processing step (either by hand or a quicky perl script) to
>> >> >> >> >> >> be
>> >> >> >> >> >> required to really get it right. I don't recall the
>> >> >> >> >> >> specifics
>> >> >> >> >> >> (if I
>> >> >> >> >> >> ever knew :-) for when and how the locus tag is used, but I
>> >> >> >> >> >> do
>> >> >> >> >> >> know
>> >> >> >> >> >> that there is a list of things that it will try to use for
>> >> >> >> >> >> the
>> >> >> >> >> >> ID,
>> >> >> >> >> >> and
>> >> >> >> >> >> while the locus is on the list, I don't know where it comes
>> >> >> >> >> >> in
>> >> >> >> >> >> the
>> >> >> >> >> >> list, so it's possible that other items might supersede it.
>> >> >> >> >> >>
>> >> >> >> >> >> Scott
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On Sat, Sep 18, 2010 at 10:05 AM, David Breimann
>> >> >> >> >> >> <david.breimann at gmail.com> wrote:
>> >> >> >> >> >> > Hello,
>> >> >> >> >> >> >
>> >> >> >> >> >> > I'm not sure how bp_genbank2gff3.pl works. Sometimes it
>> >> >> >> >> >> > adds
>> >> >> >> >> >> > a
>> >> >> >> >> >> > `locus_tag`
>> >> >> >> >> >> > in the fields and sometime it doesn't, even though the
>> >> >> >> >> >> > genabank
>> >> >> >> >> >> > has a
>> >> >> >> >> >> > locus
>> >> >> >> >> >> > tag.
>> >> >> >> >> >> > Also, is the ID always equivalent to the locus tag?
>> >> >> >> >> >> >
>> >> >> >> >> >> > Thanks,
>> >> >> >> >> >> > Dave
>> >> >> >> >> >> > _______________________________________________
>> >> >> >> >> >> > Bioperl-l mailing list
>> >> >> >> >> >> > Bioperl-l at lists.open-bio.org
>> >> >> >> >> >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> --
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> ------------------------------------------------------------------------
>> >> >> >> >> >> Scott Cain, Ph. D. scott
>> >> >> >> >> >> at
>> >> >> >> >> >> scottcain
>> >> >> >> >> >> dot net
>> >> >> >> >> >> GMOD Coordinator (http://gmod.org/)
>> >> >> >> >> >> 216-392-3087
>> >> >> >> >> >> Ontario Institute for Cancer Research
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> ------------------------------------------------------------------------
>> >> >> >> >> Scott Cain, Ph. D. scott at
>> >> >> >> >> scottcain
>> >> >> >> >> dot net
>> >> >> >> >> GMOD Coordinator (http://gmod.org/)
>> >> >> >> >> 216-392-3087
>> >> >> >> >> Ontario Institute for Cancer Research
>> >> >> >> >
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> ------------------------------------------------------------------------
>> >> >> >> Scott Cain, Ph. D. scott at
>> >> >> >> scottcain
>> >> >> >> dot net
>> >> >> >> GMOD Coordinator (http://gmod.org/)
>> >> >> >> 216-392-3087
>> >> >> >> Ontario Institute for Cancer Research
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >>
>> >> >>
>> >> >> ------------------------------------------------------------------------
>> >> >> Scott Cain, Ph. D. scott at
>> >> >> scottcain
>> >> >> dot net
>> >> >> GMOD Coordinator (http://gmod.org/) 216-392-3087
>> >> >> Ontario Institute for Cancer Research
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> ------------------------------------------------------------------------
>> >> Scott Cain, Ph. D. scott at scottcain
>> >> dot net
>> >> GMOD Coordinator (http://gmod.org/) 216-392-3087
>> >> Ontario Institute for Cancer Research
>> >
>> >
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D. scott at scottcain
>> dot net
>> GMOD Coordinator (http://gmod.org/) 216-392-3087
>> Ontario Institute for Cancer Research
>
>
--
------------------------------------------------------------------------
Scott Cain, Ph. D. scott at scottcain dot net
GMOD Coordinator (http://gmod.org/) 216-392-3087
Ontario Institute for Cancer Research
More information about the Bioperl-l
mailing list