<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">


On Jun 2, 2015, at 5:49 AM, Peter Cock &lt;<a href="mailto:p.j.a.cock@googlemail.com" class="">p.j.a.cock@googlemail.com</a>&gt; wrote:<br class="">


<div>


<blockquote type="cite" class=""><br class="Apple-interchange-newline">


<div class="">On Tue, Jun 2, 2015 at 11:32 AM, Peter Cock &lt;<a href="mailto:p.j.a.cock@googlemail.com" class="">p.j.a.cock@googlemail.com</a>&gt; wrote:<br class="">


<blockquote type="cite" class="">On Tue, Jun 2, 2015 at 11:11 AM, &nbsp;&lt;<a href="mailto:Atteyet-Alla.Yassin@ukb.uni-bonn.de" class="">Atteyet-Alla.Yassin@ukb.uni-bonn.de</a>&gt; wrote:<br class="">


<blockquote type="cite" class="">I would like to convert a gff file (which I recieved on converting a<br class="">


sequence in Genbank format using bioperl) in table e.g. like the following<br class="">


one:<br class="">


<br class="">


Seqname Source feature Start End Score Strand Frame Attributes<br class="">


chr1 hg19_gold exon 67088326 67183780 0,000000 &#43; . gene_id &quot;AL139147.7&quot;;<br class="">


transcript_id &quot;AL139147.7&quot;<br class="">


<br class="">


In my gff file you will observe the following :<br class="">


<br class="">


Lines are doubled i.e repeated e.g.<br class="">


<br class="">


<br class="">


CP008802 &nbsp;&nbsp;&nbsp;Genbank &nbsp;&nbsp;&nbsp;gene &nbsp;&nbsp;&nbsp;417 &nbsp;&nbsp;&nbsp;638 &nbsp;&nbsp;&nbsp;. &nbsp;&nbsp;&nbsp;&#43; &nbsp;&nbsp;&nbsp;. &nbsp;&nbsp;&nbsp;ID=FB03_00010<br class="">


CP008802 &nbsp;&nbsp;&nbsp;Genbank &nbsp;&nbsp;&nbsp;CDS &nbsp;&nbsp;&nbsp;417 &nbsp;&nbsp;&nbsp;638 &nbsp;&nbsp;&nbsp;. &nbsp;&nbsp;&nbsp;&#43; &nbsp;&nbsp;&nbsp;.<br class="">


Parent=FB03_00010.t00;db_xref=EnsemblGenomes-Gn%3AFB03_00010,EnsemblGenomes-Tr%3AAIE81925,UniProtKB%2FTrEMBL%3AA0A068NGQ6;codon_start=1;inference=COORDINATES%3Aab%20initio%20prediction%3AGeneMarkS%2B;product=hypothetical%20protein;translation=MAKRKKKDRGGVLTWVGIFAIVLASIADFVLFFFDNGSRYILYTLPLWFLGIGCFAWLGRAEERRNNTKRTGN;transl_table=11;note=Derived%20by%20automated%20computational%20analysis%20using%20gene%20prediction%20method%3A%20GeneMarkS%2B.;protein_id=AIE81925.1<br class="">


<br class="">


<br class="">


</blockquote>


<br class="">


I assume this is a continuation of your past email, i.e.<br class="">


<a href="http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html" class="">http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html</a><br class="">


<br class="">


You posted the full GFF file then:<br class="">


http://mailman.open-bio.org/pipermail/biopython/attachments/20150530/dd32ee7e/attachment-0001.obj<br class="">


<br class="">


Note that these &quot;repeated&quot; GFF files are normal - you have a line<br class="">


describing a &quot;gene&quot; at 417..638, and a matching &quot;CDS&quot; at 417..638.<br class="">


In the original GenBank file there would also have been two entries<br class="">


for the &quot;gene&quot; and &quot;CDS&quot;.<br class="">


<br class="">


So, given this example gene/CDS, what would you like to have<br class="">


in the output file? Maybe something like this?<br class="">


<br class="">


Seqname Source feature Start End Score Strand Frame Attributes<br class="">


CP008802 Genbank gene 417 638 0,000000 &#43; . gene_id &quot;FB03_00010&quot;;<br class="">


transcript_id &quot;FB03_00010&quot;<br class="">


<br class="">


Peter<br class="">


</blockquote>


<br class="">


You've not explained this file format, so I am guessing here<br class="">


(e.g. should start/end be counting from one, should the frame<br class="">


be just plus or minus, should feature be of type &quot;gene&quot;?).<br class="">


<br class="">


I would work from the original GenBank file rather than a<br class="">


conversion to GFF which may introduce additional problems.<br class="">


There's an example at the end of this email - but note this<br class="">


does not handle complex locations like FB03_00005 which<br class="">


appears to span the origin.<br class="">


<br class="">


Peter</div>


</blockquote>


</div>


<div class=""><br class="">


</div>


<div class="">Atteyet-Alla,</div>


<div class=""><br class="">


</div>


<div class="">My guess: this is using one of the various genbank-to-GFF scripts in BioPerl? &nbsp;Most of those are designed to work w/RefSeq data, where the features are halfway consistent. &nbsp;Also, I believe features spanning the origin are supported but this depends


 on which version of BioPerl you are using for the conversion (was added in the last few releases I believe). &nbsp;</div>


<div class=""><br class="">


</div>


<div class="">Depending on the script you use and settings, they do a fairly decent job but in many cases need some tweaks to get it where you want it.. &nbsp;Frankly they have been subsumed by using the NCBI GFF3 data directly.</div>


<div class=""><br class="">


</div>


<div class="">Speaking of, is there any reason you aren’t simply using the NCBI GFF3 and bypassing GenBank altogether? &nbsp;They have been working pretty hard to make their output GFF3-compliant, and last I checked they work with most genome browsers and parsers:</div>


<div class=""><br class="">


</div>


<div class="">GenBank:&nbsp;<a href="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000724605.1_ASM72460v1/GCA_000724605.1_ASM72460v1_genomic.gff.gz" class="">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000724605.1_ASM72460v1/GCA_000724605.1_ASM72460v1_genomic.gff.gz</a></div>


<div class="">RefSeq:&nbsp;<a href="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000724605.1_ASM72460v1/GCF_000724605.1_ASM72460v1_genomic.gff.gz" class="">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000724605.1_ASM72460v1/GCF_000724605.1_ASM72460v1_genomic.gff.gz</a></div>


<div class=""><br class="">


</div>


<div class="">(not sure about Biopython support here, but I would be really surprised if there are problems)</div>


<div class=""><br class="">


</div>


<div class="">I personally find GenBank to be a legacy format, useful for human readability but little more, and a huge pain to deal with from the parsing end due to lack of a true specification (no, the ‘Sample GenBank file’ at NCBI doesn’t count in my book


 when they change it at will).</div>


<div class=""><br class="">


</div>


<div class="">Apologies for the snarkiness, need coffee :)</div>


<div class=""><br class="">


</div>


<div class="">chris</div>


<div class=""><br class="">


</div>


</body>


</html>