extractfeat with gff files?
David Mathog
mathog at mendel.bio.caltech.edu
Wed May 21 22:04:02 UTC 2003
EMBOSS 2.6.0
I cannot seem to locate the magic incantation that will make gff
files work as desired with fasta files. HELP!
Here's the sort of line I want to extract (sorry about the wrap):
X gadfly translation 1880 3119 . - .
genegrp=CG3038; transgrp=CG3038-RB
There are many other lines in the gff for transcription,exon,
gene, etc. which should not be extracted. The fasta
input file currently has entries with names like X,2L,etc.
which correspond to the first column of the gff file.
Ideally I'd like to be able to use one gff file
(with X->3R in the first column) to extract
from one fasta file (again, with X->3R for the fasta name),
and have the descriptions of X act only on the sequence X,
and so forth. The idea being to be able to extract features
on a genomic level using only one fasta/gff pair, rather than
N (=#of scaffolds) pairs.
First though I tried an input fasta file containing
(11 X 10kb entries, the first being X) and a gff file also starting
only with X (but with references for the whole chromosome, 22101
lines). The following command sat for about two minutes, burned
a lot of CPU time, but emitted nothing:
extractfeat -sequence=dmel_genome_frag.nfa\
-ufo=x.gff -type=translation -outseq=x.nfa
When the -type qualifier was removed it went nuts and emitted
over 40000 entries (more than there were lines in the gff file!)
before I killed it. Clearly there was no error checking for
size of gff entry versus size of sequence. The input fasta
file had 11 entries of 10000 bp each. The first was X. Yet
a bunch of lines like:
>X_12390_12854 [exon] X release:3 length:21780003bp Assembled X
chromosome arm sequence md5:f3fbbb4c44f0d30d1effeecc87b5bd18
T
were emitted. So the fasta file was reduced to just one entry
(X, 10kb) and this time the output fasta file held 22101 entries.
As before, those beyond 10kb were emitted with a single base.
So apparently the entire gff description is applied to each fasta
sequence and there's no checking of the first column against
the sequence name. That's ok - we can live with that for now,
but it would be better if the descriptions could automatically
matched to the sequence names.
I'm not sure though that we can live with it emitting single
bp sequences when the description is outside of the sequence.
If the feature is beyond the end of the input sequence it just
isn't there, right?
Just to spite me "translation" was never emitted.
There were only lines for gene,exon.misc_feature,tRNA,snoRNA.
So I tried:
extractfeat -sequence=dmel_x.nfa \
-ufo=x.gff -outseq=x.nfa -type=gene
And it emitted a single whole gene match at (1488,3280,-) correctly.
The next one at (3445,11463,+) partially (and correctly, ending
at the end of the sequence - a warning would have been nice)
and then a slew of (>2000) single base pair "empty" entries
outside of the input sequence. Note also that there's no indication
on the fasta header line in the output of the strand which
was selected.
So, how does one get extractfeat to emit only matches
to "translation"? Please tell me there's some way other
than by extracting those lines into a separate
gff file and renaming them all "gene"!
Extractfeat seems to have a predefined set of "features" that it's
willing to work with and doesn't handle others well. To
narrow this down a bit more I made a small gff file containing
"fred" where "gene" had been and specifying positions <10kb.
The features were emitted but all were labeled "misc_feature".
Is this documented somewhere? It isn't in an obvious
place in the on line help, as both of these searches come up
empty.
extractfeat -h 2>&1 | grep -i misc
tfm extractfeat 2>&1 | grep -i misc
It would also be nice if there was some way to get column 9 from
the gff file onto the fasta header line somewhere. (It can
then be rearranged to suit later.) Currently even if one
has the gene names lined up with the gene entries in the gff
file the resulting fasta file just says "X_100_123 [gene]..."
without any of the comment info. You've got the sequence
but not the names of the genes. Very painful to work with if
the output is the coding sequences for an entire genome.
Is there a switch (or bug fix) that stops extractfeat
from emitting garbage single bp entries for descriptions
outside the sequence?
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the EMBOSS
mailing list