[Biopython] bcbio-gff

Mic mictadlo at gmail.com
Mon Sep 2 07:55:23 UTC 2019


Hi all,
*The below script parsed BLAST XML output and read a GFF3 file which
contains an annotation. If mRNA's ID is in the BLAST output the script will
add 'Note=Gene description' to mRNA feature.*






















































*#!/usr/bin/python3import clickfrom Bio.Blast import NCBIXMLfrom BCBio
import GFFfrom pprint import pprintclass Hits():    def __init__(self,
hit_id, hit_def):        self.hit_id = hit_id        self.hit_def =
hit_defdef retrieve_hits_data(blast_XML_file):    with open(blast_XML_file)
as bf:        blast_records = NCBIXML.parse(bf)        hits_all = {}
for blast_record in blast_records:            query_name =
blast_record.query            try:                hit_id =
blast_record.alignments[0].hit_id.split('|')[1]                hit_def =
blast_record.alignments[0].hit_def.split("OS=")[0]
hits_all[query_name] = Hits(hit_id,hit_def)            except IndexError:
              continue    return
hits_all at click.command()@click.option('--gff3', help="Provide GFF3 file",
required=True)@click.option('--keep', help="Keep GFF3 file",
required=True)@click.option('--reject', help="Reject GFF3 file",
required=True)@click.option('--xml', help="Blast XML file",
required=True)def run(gff3, keep, reject, xml):    blast_hits =
retrieve_hits_data(xml)    # pprint(blast_hits)    with open(keep, "w") as
out_keep:        with open(reject, "w") as out_reject:            for rec
in GFF.parse(gff3):                for feature in rec.features:
        for counter, sub_feature in enumerate(feature.sub_features):
                print(sub_feature.id <http://sub_feature.id>)
          mrna_id = sub_feature.id <http://sub_feature.id>
      try:                            blast_hit = blast_hits[mrna_id]
                      # feature.qualifiers["Name"] = blast_hit.hit_id
                      sub_feature.qualifiers["Note"] = blast_hit.hit_def
                        print("accepted" + mrna_id + blast_hit.hit_def)
                        GFF.write([rec], out_keep)
except KeyError:                            print("rejected" + mrna_id)
                        GFF.write([rec], out_reject)
    continue*


*if __name__ == '__main__':    run()*

*I failed to extend the above script to only keep mRNAs which have a BLAST
hit and the rejected mRNA's features to write to a different file. Below is
an example of a GFF3 file and I would like to reject g1 and g3 because mRNA
does not have any BLAST hit. The only g2 I would like to keep. What did I
miss in the above code?*

















































*##gff-version 3##sequence-region NbV1Ch08 1 129222376NbV1Ch08
 AUGUSTUS        gene    7015    29794   0.01    -       .
ID=g1NbV1Ch08        AUGUSTUS        mRNA    7015    29794   0.01    -
  .       ID=g1.t1;Parent=g1NbV1Ch08        AUGUSTUS
 transcription_end_site  7015    7015    .       -       .
Parent=g1.t1NbV1Ch08        AUGUSTUS        three_prime_utr 7015    8531
 0.2     -       .       ID=g1.t1.3UTR1;Parent=g1.t1NbV1Ch08
 AUGUSTUS        exon    7015    8747    .       -       .
ID=g1.t1.exon1;Parent=g1.t1NbV1Ch08        AUGUSTUS        stop_codon
 8532    8534    .       -       0       Parent=g1.t1NbV1Ch08
 AUGUSTUS        CDS     8532    8747    0.31    -       0
ID=g1.t1.CDS1;Parent=g1.t1NbV1Ch08        AUGUSTUS        intron  8748
 9191    0.49    -       .       Parent=g1.t1NbV1Ch08        AUGUSTUS
 CDS     9192    9342    0.66    -       1
ID=g1.t1.CDS2;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    9192
 9342    .       -       .       ID=g1.t1.exon2;Parent=g1.t1NbV1Ch08
 AUGUSTUS        intron  9343    9915    0.58    -       .
Parent=g1.t1NbV1Ch08        AUGUSTUS        CDS     9916    10006   0.71
 -       2       ID=g1.t1.CDS3;Parent=g1.t1NbV1Ch08        AUGUSTUS
 exon    9916    10006   .       -       .
ID=g1.t1.exon3;Parent=g1.t1NbV1Ch08        AUGUSTUS        intron  10007
10101   0.74    -       .       Parent=g1.t1NbV1Ch08        AUGUSTUS
 CDS     10102   10201   0.78    -       0
ID=g1.t1.CDS4;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    10102
10201   .       -       .       ID=g1.t1.exon4;Parent=g1.t1NbV1Ch08
 AUGUSTUS        intron  10202   10712   0.8     -       .
Parent=g1.t1NbV1Ch08        AUGUSTUS        CDS     10713   11107   0.11
 -       2       ID=g1.t1.CDS5;Parent=g1.t1NbV1Ch08        AUGUSTUS
 exon    10713   11107   .       -       .
ID=g1.t1.exon5;Parent=g1.t1NbV1Ch08        AUGUSTUS        intron  11108
11569   0.07    -       .       Parent=g1.t1NbV1Ch08        AUGUSTUS
 CDS     11570   12151   0.09    -       2
ID=g1.t1.CDS6;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    11570
12151   .       -       .       ID=g1.t1.exon6;Parent=g1.t1NbV1Ch08
 AUGUSTUS        intron  12152   12588   0.34    -       .
Parent=g1.t1NbV1Ch08        AUGUSTUS        CDS     12589   12717   0.39
 -       2       ID=g1.t1.CDS7;Parent=g1.t1NbV1Ch08        AUGUSTUS
 exon    12589   12717   .       -       .
ID=g1.t1.exon7;Parent=g1.t1NbV1Ch08        AUGUSTUS        intron  12718
12789   0.42    -       .       Parent=g1.t1NbV1Ch08        AUGUSTUS
 CDS     12790   13075   0.39    -       0
ID=g1.t1.CDS8;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    12790
13075   .       -       .       ID=g1.t1.exon8;Parent=g1.t1NbV1Ch08
 AUGUSTUS        intron  13076   14832   0.51    -       .
Parent=g1.t1NbV1Ch08        AUGUSTUS        CDS     14833   15009   0.39
 -       0       ID=g1.t1.CDS9;Parent=g1.t1NbV1Ch08        AUGUSTUS
 exon    14833   15009   .       -       .
ID=g1.t1.exon9;Parent=g1.t1NbV1Ch08        AUGUSTUS        intron  15010
15278   0.59    -       .       Parent=g1.t1NbV1Ch08        AUGUSTUS
 CDS     15279   15415   0.56    -       2
ID=g1.t1.CDS10;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    15279
15415   .       -       .       ID=g1.t1.exon10;Parent=g1.t1NbV1Ch08
 AUGUSTUS        intron  15416   15487   0.58    -       .
Parent=g1.t1NbV1Ch08        AUGUSTUS        CDS     15488   15612   0.96
 -       1       ID=g1.t1.CDS11;Parent=g1.t1NbV1Ch08        AUGUSTUS
 exon    15488   15612   .       -       .
ID=g1.t1.exon11;Parent=g1.t1NbV1Ch08        AUGUSTUS        intron  15613
15706   0.96    -       .       Parent=g1.t1NbV1Ch08        AUGUSTUS
 CDS     15707   15957   0.98    -       0
ID=g1.t1.CDS12;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    15707
15958   .       -       .       ID=g1.t1.exon12;Parent=g1.t1NbV1Ch08
 AUGUSTUS        start_codon     15955   15957   .       -       0
Parent=g1.t1NbV1Ch08        AUGUSTUS        five_prime_utr  15958   15958
0.99    -       .       ID=g1.t1.5UTR1;Parent=g1.t1NbV1Ch08        AUGUSTUS
       five_prime_utr  27458   28250   0.37    -       .
ID=g1.t1.5UTR2;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    27458
28250   .       -       .       ID=g1.t1.exon13;Parent=g1.t1NbV1Ch08
 AUGUSTUS        five_prime_utr  29272   29794   0.08    -       .
ID=g1.t1.5UTR3;Parent=g1.t1NbV1Ch08        AUGUSTUS        exon    29272
29794   .       -       .       ID=g1.t1.exon14;Parent=g1.t1NbV1Ch08
 AUGUSTUS        transcription_start_site        29794   29794   .       -
      .       Parent=g1.t1*





















*NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .
      ID=g2NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03
 +       .       ID=g2.t1;Note=B3 domain-containing protein
Os03g0120900;Parent=g2NbV1Ch08        AUGUSTUS
 transcription_start_site        60876   60876   .       +       .
Parent=g2.t1NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072
0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1NbV1Ch08        AUGUSTUS
       exon    60876   61072   .       +       .
ID=g2.t1.exon1;Parent=g2.t1NbV1Ch08        AUGUSTUS        five_prime_utr
 61673   61732   0.37    +       .
ID=g2.t1.5UTR2;Parent=g2.t1NbV1Ch08        AUGUSTUS        exon    61673
63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1NbV1Ch08
 AUGUSTUS        start_codon     61733   61735   .       +       0
Parent=g2.t1NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54
 +       0       ID=g2.t1.CDS1;Parent=g2.t1NbV1Ch08        AUGUSTUS
 stop_codon      62972   62974   .       +       0
Parent=g2.t1NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449
1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1NbV1Ch08        AUGUSTUS
       three_prime_utr 63565   63944   0.27    +       .
ID=g2.t1.3UTR2;Parent=g2.t1NbV1Ch08        AUGUSTUS        exon    63565
63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1NbV1Ch08
 AUGUSTUS        transcription_end_site  63944   63944   .       +       .
      Parent=g2.t1NbV1Ch08        AUGUSTUS        gene    64722   65524
0.32    -       .       ID=g3NbV1Ch08        AUGUSTUS        mRNA    64722
  65524   0.32    -       .       ID=g3.t1;Parent=g3NbV1Ch08
 AUGUSTUS        transcription_end_site  64722   64722   .       -       .
      Parent=g3.t1NbV1Ch08        AUGUSTUS        three_prime_utr 64722
64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1NbV1Ch08
 AUGUSTUS        exon    64722   65524   .       -       .
ID=g3.t1.exon1;Parent=g3.t1NbV1Ch08        AUGUSTUS        stop_codon
 64793   64795   .       -       0       Parent=g3.t1NbV1Ch08
 AUGUSTUS        CDS     64793   65494   0.44    -       0
ID=g3.t1.CDS1;Parent=g3.t1NbV1Ch08        AUGUSTUS        start_codon
65492   65494   .       -       0       Parent=g3.t1*

*Thank you in advance,*

*Best wishes,*

*Michal*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20190902/60031eea/attachment.htm>


More information about the Biopython mailing list