[Biopython] gff3 file

Tue Jun 2 10:49:41 UTC 2015

On Tue, Jun 2, 2015 at 11:32 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Jun 2, 2015 at 11:11 AM,  <Atteyet-Alla.Yassin at ukb.uni-bonn.de> wrote:
>> I would like to convert a gff file (which I recieved on converting a
>> sequence in Genbank format using bioperl) in table e.g. like the following
>> one:
>>
>> Seqname Source feature Start End Score Strand Frame Attributes
>> chr1 hg19_gold exon 67088326 67183780 0,000000 + . gene_id "AL139147.7";
>> transcript_id "AL139147.7"
>>
>> In my gff file you will observe the following :
>>
>> Lines are doubled i.e repeated e.g.
>>
>>
>> CP008802    Genbank    gene    417    638    .    +    .    ID=FB03_00010
>> CP008802    Genbank    CDS    417    638    .    +    .
>> Parent=FB03_00010.t00;db_xref=EnsemblGenomes-Gn%3AFB03_00010,EnsemblGenomes-Tr%3AAIE81925,UniProtKB%2FTrEMBL%3AA0A068NGQ6;codon_start=1;inference=COORDINATES%3Aab%20initio%20prediction%3AGeneMarkS%2B;product=hypothetical%20protein;translation=MAKRKKKDRGGVLTWVGIFAIVLASIADFVLFFFDNGSRYILYTLPLWFLGIGCFAWLGRAEERRNNTKRTGN;transl_table=11;note=Derived%20by%20automated%20computational%20analysis%20using%20gene%20prediction%20method%3A%20GeneMarkS%2B.;protein_id=AIE81925.1
>>
>>
>
> I assume this is a continuation of your past email, i.e.
> http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html
>
> You posted the full GFF file then:
> http://mailman.open-bio.org/pipermail/biopython/attachments/20150530/dd32ee7e/attachment-0001.obj
>
> Note that these "repeated" GFF files are normal - you have a line
> describing a "gene" at 417..638, and a matching "CDS" at 417..638.
> In the original GenBank file there would also have been two entries
> for the "gene" and "CDS".
>
> So, given this example gene/CDS, what would you like to have
> in the output file? Maybe something like this?
>
> Seqname Source feature Start End Score Strand Frame Attributes
> CP008802 Genbank gene 417 638 0,000000 + . gene_id "FB03_00010";
> transcript_id "FB03_00010"
>
> Peter

You've not explained this file format, so I am guessing here
(e.g. should start/end be counting from one, should the frame
be just plus or minus, should feature be of type "gene"?).

I would work from the original GenBank file rather than a
conversion to GFF which may introduce additional problems.
There's an example at the end of this email - but note this
does not handle complex locations like FB03_00005 which
appears to span the origin.

Peter

from Bio import SeqIO
with open("CP008802.txt", "w") as output:
    output.write("Seqname\tSource\tfeature\tStart\tEnd\tScore\tStrand\tFrame\tAttributes\n")
    for record in SeqIO.parse("CP008802.gbk", "genbank"):
        print("Converting %s" % record.name)
        for f in record.features:
            if f.type != "gene":
                continue
            locus_tag = f.qualifiers["locus_tag"][0]
            if len(f.location.parts) > 1:
                print("What should we do for %s (compound location)?
%s" % (locus_tag, f.location))
                continue
            output.write('%s\tGenBank\t%s\t%i\t%i\t0,000000\t%s\t.\tlocus_tag\t"%s";
transcript_id "%s"\n'
                         % (record.name, f.type,
                            f.location.start + 1, f.location.end,
f.location.strand,
                            locus_tag, locus_tag))
print("Done")