<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
On Jun 2, 2015, at 5:49 AM, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" class="">p.j.a.cock@googlemail.com</a>> wrote:<br class="">
<div>
<blockquote type="cite" class=""><br class="Apple-interchange-newline">
<div class="">On Tue, Jun 2, 2015 at 11:32 AM, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" class="">p.j.a.cock@googlemail.com</a>> wrote:<br class="">
<blockquote type="cite" class="">On Tue, Jun 2, 2015 at 11:11 AM, <<a href="mailto:Atteyet-Alla.Yassin@ukb.uni-bonn.de" class="">Atteyet-Alla.Yassin@ukb.uni-bonn.de</a>> wrote:<br class="">
<blockquote type="cite" class="">I would like to convert a gff file (which I recieved on converting a<br class="">
sequence in Genbank format using bioperl) in table e.g. like the following<br class="">
one:<br class="">
<br class="">
Seqname Source feature Start End Score Strand Frame Attributes<br class="">
chr1 hg19_gold exon 67088326 67183780 0,000000 + . gene_id "AL139147.7";<br class="">
transcript_id "AL139147.7"<br class="">
<br class="">
In my gff file you will observe the following :<br class="">
<br class="">
Lines are doubled i.e repeated e.g.<br class="">
<br class="">
<br class="">
CP008802 Genbank gene 417 638 . + . ID=FB03_00010<br class="">
CP008802 Genbank CDS 417 638 . + .<br class="">
Parent=FB03_00010.t00;db_xref=EnsemblGenomes-Gn%3AFB03_00010,EnsemblGenomes-Tr%3AAIE81925,UniProtKB%2FTrEMBL%3AA0A068NGQ6;codon_start=1;inference=COORDINATES%3Aab%20initio%20prediction%3AGeneMarkS%2B;product=hypothetical%20protein;translation=MAKRKKKDRGGVLTWVGIFAIVLASIADFVLFFFDNGSRYILYTLPLWFLGIGCFAWLGRAEERRNNTKRTGN;transl_table=11;note=Derived%20by%20automated%20computational%20analysis%20using%20gene%20prediction%20method%3A%20GeneMarkS%2B.;protein_id=AIE81925.1<br class="">
<br class="">
<br class="">
</blockquote>
<br class="">
I assume this is a continuation of your past email, i.e.<br class="">
<a href="http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html" class="">http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html</a><br class="">
<br class="">
You posted the full GFF file then:<br class="">
http://mailman.open-bio.org/pipermail/biopython/attachments/20150530/dd32ee7e/attachment-0001.obj<br class="">
<br class="">
Note that these "repeated" GFF files are normal - you have a line<br class="">
describing a "gene" at 417..638, and a matching "CDS" at 417..638.<br class="">
In the original GenBank file there would also have been two entries<br class="">
for the "gene" and "CDS".<br class="">
<br class="">
So, given this example gene/CDS, what would you like to have<br class="">
in the output file? Maybe something like this?<br class="">
<br class="">
Seqname Source feature Start End Score Strand Frame Attributes<br class="">
CP008802 Genbank gene 417 638 0,000000 + . gene_id "FB03_00010";<br class="">
transcript_id "FB03_00010"<br class="">
<br class="">
Peter<br class="">
</blockquote>
<br class="">
You've not explained this file format, so I am guessing here<br class="">
(e.g. should start/end be counting from one, should the frame<br class="">
be just plus or minus, should feature be of type "gene"?).<br class="">
<br class="">
I would work from the original GenBank file rather than a<br class="">
conversion to GFF which may introduce additional problems.<br class="">
There's an example at the end of this email - but note this<br class="">
does not handle complex locations like FB03_00005 which<br class="">
appears to span the origin.<br class="">
<br class="">
Peter</div>
</blockquote>
</div>
<div class=""><br class="">
</div>
<div class="">Atteyet-Alla,</div>
<div class=""><br class="">
</div>
<div class="">My guess: this is using one of the various genbank-to-GFF scripts in BioPerl? Most of those are designed to work w/RefSeq data, where the features are halfway consistent. Also, I believe features spanning the origin are supported but this depends
on which version of BioPerl you are using for the conversion (was added in the last few releases I believe). </div>
<div class=""><br class="">
</div>
<div class="">Depending on the script you use and settings, they do a fairly decent job but in many cases need some tweaks to get it where you want it.. Frankly they have been subsumed by using the NCBI GFF3 data directly.</div>
<div class=""><br class="">
</div>
<div class="">Speaking of, is there any reason you aren’t simply using the NCBI GFF3 and bypassing GenBank altogether? They have been working pretty hard to make their output GFF3-compliant, and last I checked they work with most genome browsers and parsers:</div>
<div class=""><br class="">
</div>
<div class="">GenBank: <a href="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000724605.1_ASM72460v1/GCA_000724605.1_ASM72460v1_genomic.gff.gz" class="">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000724605.1_ASM72460v1/GCA_000724605.1_ASM72460v1_genomic.gff.gz</a></div>
<div class="">RefSeq: <a href="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000724605.1_ASM72460v1/GCF_000724605.1_ASM72460v1_genomic.gff.gz" class="">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000724605.1_ASM72460v1/GCF_000724605.1_ASM72460v1_genomic.gff.gz</a></div>
<div class=""><br class="">
</div>
<div class="">(not sure about Biopython support here, but I would be really surprised if there are problems)</div>
<div class=""><br class="">
</div>
<div class="">I personally find GenBank to be a legacy format, useful for human readability but little more, and a huge pain to deal with from the parsing end due to lack of a true specification (no, the ‘Sample GenBank file’ at NCBI doesn’t count in my book
when they change it at will).</div>
<div class=""><br class="">
</div>
<div class="">Apologies for the snarkiness, need coffee :)</div>
<div class=""><br class="">
</div>
<div class="">chris</div>
<div class=""><br class="">
</div>
</body>
</html>