[Bioperl-l] GFF file and load_gff.pl

Scott Cain scott at scottcain.net
Wed Jan 28 18:51:16 UTC 2009


Hi Richard,

A few items:

* It looks as though the loader didn't know that it was loading GFF3
(you can tell it's GFF3 by the = between the tags and values in the
ninth column; in GFF2, there would be a space).  As a result, the
classes weren't created properly.  Check that there is a line at the
top of your GFF file that looks like "##gff-version 3"

* You may not want to use a Bio::DB::GFF database anyway.  Since you
are just getting started and have GFF3, you might be better off using
a Bio::DB::SeqFeature::Store database, which was designed to work with
GFF3 data (Bio::DB::GFF works better with GFF2).  The loader for a
SeqFeature::Store database is called bp_seqfeature_load.pl.

Scott


On Wed, Jan 28, 2009 at 12:36 PM, Richard Harrison
<richard.harrison at ed.ac.uk> wrote:
> Thank you Chris, Scott and Adam,
> You are right, I was confused. I have now managed to create a Bio::DB::GFF
> database with my genome annotation loaded into it. One further question.
> I am having trouble retrieving the desired info from the database.  Shown
> below is a typical entry into the GFF file for a gene
>
>
> #chr01  SGD     gene    33449   34702   .       +       .
> ID=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO:0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO:0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase%20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by%20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an%20essential%20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>
> #chr01  SGD     CDS     33449   34702   .       +       0
> Parent=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO:0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO:0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase%20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by%20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an%20essential%20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>
>
> I would like to search the database for YAL061W and retrieve the CDS
> coordinates, details about introns etc. I don't need the sequence, as I have
> separate multiple genome-alignments..
>
>
> At present all I can work out how to do is  get all feature types and
> classes  in the database.. (see code below)
>
>
> my $db      = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>                                   -dsn     => 'dbi:mysql:biosql',
>                                   user => 'root',
>                                   pass => '*******'
>                                 );
>        #get types
>        my @types = $db->types;
>
> EG:
> #telomere:SGDintron:SGDinsertion:SGDchromosome:SGDregion:landmarkncRNA:SGDtransposable_element_gene:SGDregion:SGDARS:SGDsnRNA:SGDsnoRNA:SGDnc_primary_transcript:SGDrRNA etc...
>
>
>
>        #get classes
>        my @classes = $db->classes;
>
> ID=YKR067W
> ID=YKR068C
> ID=YKR069W
> ID=YKR070W
> ID=YKR071C
> ID=YKR072C
> ID=YKR073C
> ID=YKR074W
>
> etc...
>
> Could someone point me towards a useful set of pointers for this. I've tried
> reading the documentation but it doesn't seem to illustrate what I want to
> do.
>
> Best wishes and thanks for the help so far,
>
> Richard
>
>
>
>
>
>
>
> On 28 Jan 2009, at 16:15, Scott Cain wrote:
>
>> Hi Richard,
>>
>> Your mixing up two database schemas.  Do you want to use a BioSQL
>> database (bioperl-db) or a Bio::DB::GFF database?  I'm guessing that
>> you want the latter, so I'll answer that question (as it's the easier
>> one anyway).  You need to add the "-c" flag (for --create) to the
>> load_gff.pl command to create the Bio::DB::GFF schema.
>>
>> If you really wanted a BioSQL database, you'll have to wait for help
>> from someone else more knowledgeable about it.
>>
>> Scott
>>
>>
>>
>>
>> On Wed, Jan 28, 2009 at 10:22 AM, Richard Harrison
>> <richard.harrison at ed.ac.uk> wrote:
>>>
>>> Dear all,
>>>
>>> I am running Bioperl 1.6 on osx- leopard on a macbook pro.
>>>
>>> I have installed mysql-5.1.30-osx10.5-x86, DBD-mysql-4.010, the
>>> biosql-schema for mysql and bioperl-db.  As per the instructions I have a
>>> database called biosql which I associated the SQL dialect
>>> biosqldb-mysql.sql
>>>
>>> After much fannying, the install seems fine....although i can't be sure
>>> (never used mysql before)
>>>
>>> I am having problems with the script load_gff.pl
>>>
>>> I want to load  a database with the data from a genome.gff file (for
>>> saccharomyces cerevisiae). I don't want to add sequence to it, as all i
>>> need
>>> is the annotation.
>>>
>>> I have tried the following command(s):
>>>
>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword genome.gff
>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword
>>> --adaptor=dbi::mysql
>>> genome.gff
>>>
>>> With both I get the following error:
>>>
>>> No ftype id for CDS:SGD Table 'biosql.ftype' doesn't exist Record
>>> skipped.
>>> (then another few '000 of these)
>>> then..
>>>
>>> genome.gff: 16379 records loaded
>>>
>>>
>>> Any ideas where I'm going wrong?
>>>
>>> Thanks,
>>>
>>> Richard
>>>
>>> ____________________________
>>> Dr Richard Harrison
>>> 127 Ashworth Labs
>>> Institutes of Evolutionary Biology
>>> King's Buildings
>>> West Mains Road
>>> Edinburgh EH9 3JT
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                   scott at scottcain
>> dot net
>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> Ontario Institute for Cancer Research
>>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research




More information about the Bioperl-l mailing list