[Bioperl-l] Re: GFF3

Scott Cain cain at cshl.edu
Mon Jan 17 12:04:54 EST 2005


Hi Rob,

Thanks for your work on this--I've put several comments in your
original message below.

Scott

---------Original Message--------
Date: Sat, 15 Jan 2005 15:22:23 -0800
From: Rob Edwards <rob at salmonella.org>
Subject: [Bioperl-l] GFF3
To: Bioperl list <bioperl-l at portal.open-bio.org>

Because I need it for some things that I am doing, I have worked quite 
a bit on the GFF3 parser Bio::FeatureIO::gff. Several people have 
written this module, I have just made some cosmetic changes:

I have improved the validation processes that are applied as a gff3 
file is parsed, and the module should now validate essentially 
everything in the file except alignments. Validation is optional and is 
based on the specification described at : 
http://song.sourceforge.net/gff3.shtml

SC> Excellent--Did you happen to relax the requirement that ID be unique
SC> for each line of the GFF?  Allen and I put that in due to a misreading
SC> of the spec.  The ID has to be unique for a *feature*, which can be
SC> spread across several lines.

For clarification and edification I have created a couple of tables 
describing the module and the validation that is applied to GFF3 files, 
which you can see online: http://www.salmonella.org/bioperl/gff3.html

SC> Very nice and well done--do you happen to have a pod-ified version 
SC> of this page?  It would be nice to include in the pod for 
SC> Bio::FeatureIO::gff.

I also wrote a Bio::SeqIO::gff module. Since gff3 files can hold 
sequences, it seems that you'd want to be able to call the next_seq 
methods, and therefore SeqIO is more appropriate than FeatureIO for 
those aspects. Currently the SeqIO module uses the FeatureIO module for 
parsing the file, it just reorganizes things.

This provides two different interfaces for getting objects out of GFF3 
files:
	Bio::FeatureIO::gff will return Bio::SeqFeature::Annotated objects 
representing the annotations.
	Bio::SeqIO::gff will return Bio::Seq objects representing the 
sequences with all the annotations attached.

The other difference between the two is that the former passes out the 
objects as they are read, but the latter has to read the whole file to 
get the annotations and the sequences.

SC> I thought about doing something similar with SeqIO, but I am worried 
SC> about the case where somebody tries to use SeqIO on a well 
SC> annotated human Chr1 GFF3 file (if one were ever to exist :-) ,
SC> but I suppose the same machine killing thing could be done if
SC> someone tried to use SeqIO on a genbank file of Chr1.

At the moment I focussed on reading GFF3 files.

I have not committed these to cvs yet, pending comments from others. I 
have some specific questions:
	Should I wait until after 1.5 is out?

SC> I don't have the definative answer, but I would say it doesn't
SC> matter much, as long as it passes tests.  Bio::FeatureIO::gff is
SC> hardly a fully functional module as it is, so if we could 
SC> squeeze a little more functionality into it before we
SC> release it, that would be fine with me.

	Is two separate modules really the right way to go about this?

SC> As long as it works for this case, I don't mind:  calling
SC> 'next_feature' on a FeatureIO object until I run out of features
SC> and then calling 'next_sequence' (and get a Bio::PrimarySeq) on
SC> the same FeatureIO object until I run out of sequences.

	What about other GFF modules (like Bio::Tools::GFF)?

SC> I am willing to let Bio::Tools::GFF die a terrible death.  While
SC> it will have to be kept around for apps that depend on it, I don't
SC> see adding any major functionality as time well spent.

	Could someone give the modules a workout and let me know about bugs? I 
am sure there are many.

SC> I will try to soon, but it won't be until next week at 
SC> the earliest.

I have posted these modules online via anonymous ftp at 
ftp://ftp.salmonella.org/rob/bioperl/GFF_modules.tgz
Take a look and let me know what you do and don't like!

Rob


----------------------------------------------------------------------
Scott Cain, Ph. D.				 	 cain at cshl.org
GMOD Coordinator, http://www.gmod.org/			 (216)392-3087
----------------------------------------------------------------------





More information about the Bioperl-l mailing list