[Bioperl-l] help creating de novo GFF3
Joshua Orvis
jorvis at gmail.com
Fri Aug 15 19:45:23 UTC 2008
I don't have a lot of experience with Bioperl and have used it mostly for
simple format conversions or parsing Genbank files. I need to create a
quick script to create GFF3 and decided to give bioperl a try again instead
of just printing the columns myself but have had a few problems. My
apologies for the narrative here but I know it can sometimes be informative
to hear 'how' a user arrived at a problem rather than just knowing the
problem itself.
Is there a documented explicit mapping between the GFF3 columns and the
predefined tags (ID, Name, etc.) and their Bioperl object attribute
equivalents? Is it preferrable to create Bio::SeqFeature::Generic objects
and pass them to Bio::Tools::GFF->write_feature or rather to create
Bio::SeqFeature::Annotated and pass them to Bio::FeatureIO::gff ? I may be
overlooking it, but a simple tutorial showing how to create and define a new
sequence object, attach annotations to it and dump in GFF format seems to be
missing. This seems like a basic thing to do - most of the documentation I
find is about converting between formats rather than creating new
annotation.
Here are some of the problems I (a typical naive user?) ran into when
adventuring with bioperl here. My first attempt resulted in the string
"SEQ" as column 0 in all my GFF output. I thought that maybe this was
because my features weren't 'attached' to a sequence, so I created a
Bio::Seq::RichSeq object and tried both (separately):
$seq->add_SeqFeature( $feat );
and
$feat->attach_seq( $seq );
Neither changed the first column of output. Looking at the
docs.bioperl.orgmethods for Bio::SeqFeature::Generic I found the
seq_id attribute, which
came with the warning: "This attribute should *not* be used in GFF dumping"
- but since it's the only thing I did that worked, I used it anyway.
Next I wanted to have ID tags within my last column. I first tried setting
all the relevant attributes I could see on my features (id, primary_tag,
display_name, display_id, etc.) but none of these caused ID=? to be
created. Next, I tried something like this:
my $feat = new Bio::SeqFeature::Annotated (
-start => $start,
-end => $end,
-strand => $strand,
-primary => 'gene',
-seq_id => $asmbl_id, ## this works but is discouraged
-tag => { ID => $transcript->{pub_locus},
product_name => $transcript->{com_name},
ec_number => $transcript->{'ec#'},
gene_symbol => $transcript->{gene_sym}
}
);
My hopes that passing it via the -tag option would do the trick failed, as
it created a line like this instead:
10263 . gene 58512 56983 . + . iD=AN9220.4;
Notice the 'ID' -> 'iD' transformation (without any command-line warnings).
I'm still stuck on this one (Parent would be next) but overall guidance or
pointers to a tutorial/documentation I'm overlooking would be great.
JO
More information about the Bioperl-l
mailing list