[Bioperl-l] methods, etc. for Bio::SearchIO on exonerate output

Cook, Malcolm MEC at Stowers-Institute.org
Fri Sep 2 11:22:15 EDT 2005


Hmmmm - I'd better get some clarification from Guy too.  
 
Guy, if you don't mind reading the thread below and chiming in on our
discussion of interpreting the output of your excellent exonerate
program:
 
The sections of the manpage (
<http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html>
http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html) that appear
relevant are these 2 excerpts:
 
 1) When an alignment is reported on the reverse complement of a
sequence, the coordinates are simply given on the reverse complement
copy of the sequence. Hence positions on the sequences are never
negative. Generally, the forward strand is indicated by '+', the reverse
strand by '-', and an unknown or not-applicable strand (as in the case
of a protein sequence) is indicated by '.' "
 
2)  --forwardcoordinates <boolean> By default, all coordinates are
reported on the forward strand. Setting this option to false reverts to
the old behaviour (pre-0.8.3) whereby alignments on the reverse
complement of a sequence are reported using coordinates on the reverse
complement. 
 
We see GFF DUMP coordinates still reported on the reverse stand
regardless of the setting of --forwardcoordinates.  So these two
excerpts from you manpage seem contradictory to me.     Unless I
understand `--forwardcoordinates FALSE` to only effect the coordinates
reported in the alignment section, not in the GFF DUMP section, which is
what it appears to do in practice.
 
Guy, can you confirm that the --forwardcoordinates option has no effect
on GFF output?
 
Further, can you tell us if you plan to comport more closely to the GFF
spec, in particular in this case by reporting forwardcoordinates in the
GFF DUMP section too?   I see 
I see in your TODO list "    o Should GFF show all coordinates on the
+ve strand? (jason_p2g eg)".  Hear hear!  I second the motion.
 
And TODO item " GFF3 support ? http://song.sf.net/" gets my vote too....
though this is more of a sticky wicket....
 
Cheers and Thanks!
 
Malcolm Cook
 
 
-----Original Message-----
From: Jason Stajich [mailto:jason.stajich at duke.edu] 
Sent: Friday, September 02, 2005 9:46 AM
To: Cook, Malcolm
Cc: bioperl-l
Subject: Re: [Bioperl-l] methods, etc. for Bio::SearchIO on exonerate
output


I've already talked to Guy about some of this and I assume fixes will be
part of the next release, but it can't hurt to have more people
requesting.  The main problem right now is reverse strand hits in GFF
output are still screwed up even if you provide the --forwardcoordinates
option. 

If someone wanted to write/donate a VULGAR to GFF subroutine (okay
VULGAR to a list of Bio::Search::HSP::GenericHSP).  We can also
reconstruct everything needed from that, I gave a stab at it once, but
there was something missing (or maybe it was pre --forwardcoordinates
option).   


-jason 

On Sep 2, 2005, at 10:36 AM, Cook, Malcolm wrote:


Jason,
 
Thanks for the scripts and clues (esp re: using the --ryo option to
inject the needed length into the exonerate output to compensate).
 
I'm considering asking exonerate author to comport with GFF spec.  Do
you think this is a road to take?
 
Cheers,
 
Malcolm
 
-----Original Message-----
From: Jason Stajich [mailto:jason.stajich at duke.edu] 
Sent: Wednesday, August 31, 2005 12:35 PM
To: Cook, Malcolm
Cc: bioperl-l
Subject: Re: [Bioperl-l] methods, etc. for Bio::SearchIO on exonerate
output


http://fungal.genome.duke.edu/~jes12/software/scripts/process_exonerate_
gff3.pl

You may still want to massage it some, but I use the script in this
basic form, maybe with a few tweaks:

Note that it requires you to run exonerate with specific --ryo options
so that it includes the length of the query and hit sequences in the
report output. should be covered in the perldoc in the script.

Without the ryo options enabled,  you'll need to modify the script more
to have access to the original sequence db, use Bio::DB::Fasta,  and put
in some $dbh->length($seqid) calls instead.

I don't think the part which writes HSP/match lines is actually correct
- it is trying to roll gapped HSPs from the similarity features. 

I end up ignoring all but the 'exon' and 'gene' lines for my gbrowse
instance and/or grepping out the lines I really think I need.  
You may want to s/exon/CDS/ for the protein2genome output as well.

-jason

On Aug 31, 2005, at 1:04 PM, Cook, Malcolm wrote:


Jason, 

This message is in regards to an old thread  in which you offered to
shared a 'script for munging over' exonerate output for lading in
DB::GFF (c.f.
<http://bioperl.org/pipermail/bioperl-l/2005-April/018741.html>
http://bioperl.org/pipermail/bioperl-l/2005-April/018741.html)

Would you be willing to still share that script, if you've got it
around? 

Thanks, and regards, 

Malcolm Cook -  <mailto:mec at stowers-institute.org>
mec at stowers-institute.org - 816-926-4449
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, MO  USA





--
Jason Stajich
Duke University
http://www.duke.edu/~jes12




--
Jason Stajich
Duke University
http://www.duke.edu/~jes12





More information about the Bioperl-l mailing list