[Biopython] bcbio-gff
Mic
mictadlo at gmail.com
Mon Sep 2 07:55:23 UTC 2019
Hi all,
*The below script parsed BLAST XML output and read a GFF3 file which
contains an annotation. If mRNA's ID is in the BLAST output the script will
add 'Note=Gene description' to mRNA feature.*
*#!/usr/bin/python3import clickfrom Bio.Blast import NCBIXMLfrom BCBio
import GFFfrom pprint import pprintclass Hits(): def __init__(self,
hit_id, hit_def): self.hit_id = hit_id self.hit_def =
hit_defdef retrieve_hits_data(blast_XML_file): with open(blast_XML_file)
as bf: blast_records = NCBIXML.parse(bf) hits_all = {}
for blast_record in blast_records: query_name =
blast_record.query try: hit_id =
blast_record.alignments[0].hit_id.split('|')[1] hit_def =
blast_record.alignments[0].hit_def.split("OS=")[0]
hits_all[query_name] = Hits(hit_id,hit_def) except IndexError:
continue return
hits_all at click.command()@click.option('--gff3', help="Provide GFF3 file",
required=True)@click.option('--keep', help="Keep GFF3 file",
required=True)@click.option('--reject', help="Reject GFF3 file",
required=True)@click.option('--xml', help="Blast XML file",
required=True)def run(gff3, keep, reject, xml): blast_hits =
retrieve_hits_data(xml) # pprint(blast_hits) with open(keep, "w") as
out_keep: with open(reject, "w") as out_reject: for rec
in GFF.parse(gff3): for feature in rec.features:
for counter, sub_feature in enumerate(feature.sub_features):
print(sub_feature.id <http://sub_feature.id>)
mrna_id = sub_feature.id <http://sub_feature.id>
try: blast_hit = blast_hits[mrna_id]
# feature.qualifiers["Name"] = blast_hit.hit_id
sub_feature.qualifiers["Note"] = blast_hit.hit_def
print("accepted" + mrna_id + blast_hit.hit_def)
GFF.write([rec], out_keep)
except KeyError: print("rejected" + mrna_id)
GFF.write([rec], out_reject)
continue*
*if __name__ == '__main__': run()*
*I failed to extend the above script to only keep mRNAs which have a BLAST
hit and the rejected mRNA's features to write to a different file. Below is
an example of a GFF3 file and I would like to reject g1 and g3 because mRNA
does not have any BLAST hit. The only g2 I would like to keep. What did I
miss in the above code?*
*##gff-version 3##sequence-region NbV1Ch08 1 129222376NbV1Ch08
AUGUSTUS gene 7015 29794 0.01 - .
ID=g1NbV1Ch08 AUGUSTUS mRNA 7015 29794 0.01 -
. ID=g1.t1;Parent=g1NbV1Ch08 AUGUSTUS
transcription_end_site 7015 7015 . - .
Parent=g1.t1NbV1Ch08 AUGUSTUS three_prime_utr 7015 8531
0.2 - . ID=g1.t1.3UTR1;Parent=g1.t1NbV1Ch08
AUGUSTUS exon 7015 8747 . - .
ID=g1.t1.exon1;Parent=g1.t1NbV1Ch08 AUGUSTUS stop_codon
8532 8534 . - 0 Parent=g1.t1NbV1Ch08
AUGUSTUS CDS 8532 8747 0.31 - 0
ID=g1.t1.CDS1;Parent=g1.t1NbV1Ch08 AUGUSTUS intron 8748
9191 0.49 - . Parent=g1.t1NbV1Ch08 AUGUSTUS
CDS 9192 9342 0.66 - 1
ID=g1.t1.CDS2;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 9192
9342 . - . ID=g1.t1.exon2;Parent=g1.t1NbV1Ch08
AUGUSTUS intron 9343 9915 0.58 - .
Parent=g1.t1NbV1Ch08 AUGUSTUS CDS 9916 10006 0.71
- 2 ID=g1.t1.CDS3;Parent=g1.t1NbV1Ch08 AUGUSTUS
exon 9916 10006 . - .
ID=g1.t1.exon3;Parent=g1.t1NbV1Ch08 AUGUSTUS intron 10007
10101 0.74 - . Parent=g1.t1NbV1Ch08 AUGUSTUS
CDS 10102 10201 0.78 - 0
ID=g1.t1.CDS4;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 10102
10201 . - . ID=g1.t1.exon4;Parent=g1.t1NbV1Ch08
AUGUSTUS intron 10202 10712 0.8 - .
Parent=g1.t1NbV1Ch08 AUGUSTUS CDS 10713 11107 0.11
- 2 ID=g1.t1.CDS5;Parent=g1.t1NbV1Ch08 AUGUSTUS
exon 10713 11107 . - .
ID=g1.t1.exon5;Parent=g1.t1NbV1Ch08 AUGUSTUS intron 11108
11569 0.07 - . Parent=g1.t1NbV1Ch08 AUGUSTUS
CDS 11570 12151 0.09 - 2
ID=g1.t1.CDS6;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 11570
12151 . - . ID=g1.t1.exon6;Parent=g1.t1NbV1Ch08
AUGUSTUS intron 12152 12588 0.34 - .
Parent=g1.t1NbV1Ch08 AUGUSTUS CDS 12589 12717 0.39
- 2 ID=g1.t1.CDS7;Parent=g1.t1NbV1Ch08 AUGUSTUS
exon 12589 12717 . - .
ID=g1.t1.exon7;Parent=g1.t1NbV1Ch08 AUGUSTUS intron 12718
12789 0.42 - . Parent=g1.t1NbV1Ch08 AUGUSTUS
CDS 12790 13075 0.39 - 0
ID=g1.t1.CDS8;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 12790
13075 . - . ID=g1.t1.exon8;Parent=g1.t1NbV1Ch08
AUGUSTUS intron 13076 14832 0.51 - .
Parent=g1.t1NbV1Ch08 AUGUSTUS CDS 14833 15009 0.39
- 0 ID=g1.t1.CDS9;Parent=g1.t1NbV1Ch08 AUGUSTUS
exon 14833 15009 . - .
ID=g1.t1.exon9;Parent=g1.t1NbV1Ch08 AUGUSTUS intron 15010
15278 0.59 - . Parent=g1.t1NbV1Ch08 AUGUSTUS
CDS 15279 15415 0.56 - 2
ID=g1.t1.CDS10;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 15279
15415 . - . ID=g1.t1.exon10;Parent=g1.t1NbV1Ch08
AUGUSTUS intron 15416 15487 0.58 - .
Parent=g1.t1NbV1Ch08 AUGUSTUS CDS 15488 15612 0.96
- 1 ID=g1.t1.CDS11;Parent=g1.t1NbV1Ch08 AUGUSTUS
exon 15488 15612 . - .
ID=g1.t1.exon11;Parent=g1.t1NbV1Ch08 AUGUSTUS intron 15613
15706 0.96 - . Parent=g1.t1NbV1Ch08 AUGUSTUS
CDS 15707 15957 0.98 - 0
ID=g1.t1.CDS12;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 15707
15958 . - . ID=g1.t1.exon12;Parent=g1.t1NbV1Ch08
AUGUSTUS start_codon 15955 15957 . - 0
Parent=g1.t1NbV1Ch08 AUGUSTUS five_prime_utr 15958 15958
0.99 - . ID=g1.t1.5UTR1;Parent=g1.t1NbV1Ch08 AUGUSTUS
five_prime_utr 27458 28250 0.37 - .
ID=g1.t1.5UTR2;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 27458
28250 . - . ID=g1.t1.exon13;Parent=g1.t1NbV1Ch08
AUGUSTUS five_prime_utr 29272 29794 0.08 - .
ID=g1.t1.5UTR3;Parent=g1.t1NbV1Ch08 AUGUSTUS exon 29272
29794 . - . ID=g1.t1.exon14;Parent=g1.t1NbV1Ch08
AUGUSTUS transcription_start_site 29794 29794 . -
. Parent=g1.t1*
*NbV1Ch08 AUGUSTUS gene 60876 63944 0.03 + .
ID=g2NbV1Ch08 AUGUSTUS mRNA 60876 63944 0.03
+ . ID=g2.t1;Note=B3 domain-containing protein
Os03g0120900;Parent=g2NbV1Ch08 AUGUSTUS
transcription_start_site 60876 60876 . + .
Parent=g2.t1NbV1Ch08 AUGUSTUS five_prime_utr 60876 61072
0.19 + . ID=g2.t1.5UTR1;Parent=g2.t1NbV1Ch08 AUGUSTUS
exon 60876 61072 . + .
ID=g2.t1.exon1;Parent=g2.t1NbV1Ch08 AUGUSTUS five_prime_utr
61673 61732 0.37 + .
ID=g2.t1.5UTR2;Parent=g2.t1NbV1Ch08 AUGUSTUS exon 61673
63449 . + . ID=g2.t1.exon2;Parent=g2.t1NbV1Ch08
AUGUSTUS start_codon 61733 61735 . + 0
Parent=g2.t1NbV1Ch08 AUGUSTUS CDS 61733 62974 0.54
+ 0 ID=g2.t1.CDS1;Parent=g2.t1NbV1Ch08 AUGUSTUS
stop_codon 62972 62974 . + 0
Parent=g2.t1NbV1Ch08 AUGUSTUS three_prime_utr 62975 63449
1 + . ID=g2.t1.3UTR1;Parent=g2.t1NbV1Ch08 AUGUSTUS
three_prime_utr 63565 63944 0.27 + .
ID=g2.t1.3UTR2;Parent=g2.t1NbV1Ch08 AUGUSTUS exon 63565
63944 . + . ID=g2.t1.exon3;Parent=g2.t1NbV1Ch08
AUGUSTUS transcription_end_site 63944 63944 . + .
Parent=g2.t1NbV1Ch08 AUGUSTUS gene 64722 65524
0.32 - . ID=g3NbV1Ch08 AUGUSTUS mRNA 64722
65524 0.32 - . ID=g3.t1;Parent=g3NbV1Ch08
AUGUSTUS transcription_end_site 64722 64722 . - .
Parent=g3.t1NbV1Ch08 AUGUSTUS three_prime_utr 64722
64792 0.77 - . ID=g3.t1.3UTR1;Parent=g3.t1NbV1Ch08
AUGUSTUS exon 64722 65524 . - .
ID=g3.t1.exon1;Parent=g3.t1NbV1Ch08 AUGUSTUS stop_codon
64793 64795 . - 0 Parent=g3.t1NbV1Ch08
AUGUSTUS CDS 64793 65494 0.44 - 0
ID=g3.t1.CDS1;Parent=g3.t1NbV1Ch08 AUGUSTUS start_codon
65492 65494 . - 0 Parent=g3.t1*
*Thank you in advance,*
*Best wishes,*
*Michal*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20190902/60031eea/attachment.htm>
More information about the Biopython
mailing list