[Bioperl-l] Location having both "before" and "after" (<x..>y)
jlklassen at wisc.edu
jlklassen at wisc.edu
Tue Jul 16 14:36:09 UTC 2013
Hello all,
I am attempting to output a CDS annotation where the CDS spans both the
beginning and ending of my contig, according to the NCBI standard:
http://www.ncbi.nlm.nih.gov/genbank/examples.wgs#partialcds. According to
http://www.bioperl.org/wiki/HOWTO:Feature-Annotation, my understanding is
that BioPerl supports both "BEFORE" and "AFTER" annotations, and indeed I
can output an appropriate gbk with either the "BEFORE" or "AFTER"
individually. However, when I have them both together I get something like
this:
LOCUS Contig_381.1 220 bp dna linear UNK
ACCESSION unknown
FEATURES Location/Qualifiers
CDS <3
/locus_id="M32_06772"
/gene="Contig_381.1_1"
/translation="AYGARAKGGVVAWEPLPVQYADYALWQQELLGGGGDAGGVVSQG
LAYWEGVLEGLPQELELPVDRPRPVRPS"
ORIGIN
1 gggcgtacgg ggcgcgtgcc aagggtgggg tggtggcgtg ggagccgttg ccggtgcagt
61 acgcggatta cgcgttgtgg cagcaggagc tgctcggggg tgggggtgat gcgggcggtg
121 tggtctcgca agggttggcg tattgggagg gggtgctgga ggggttgccg caggagctgg
181 agttgccggt ggaccggccg cgtccggtgc gtccgtccta
//
You can see that the the CDS location annotation has the appropriate 5'
annotation, but not the 3' (should be >218). Note that the translation is
parsed from the Prodigal output and not generated directly in BioPerl. I'm
a somewhat new to BioPerl, so am hoping that I'm missing something
relatively simple, but a search of the mailing list did not reveal an
obvious answer. I'm using BioPerl v1.6.901 on Ubuntu LINUX 12.04 (Biolinux
7). My code is to create a multiple gbk file from a set of bacterial gene
models, as predicted using Prodigal using multiple contigs as a template.
Many thanks in advance for the help
Jonathan Klassen
Postdoctoral Fellow
University of Wisconsin-Madison
abbreviated code:
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::SeqUtils;
use Bio::SeqFeature::Generic;
my $ffn_obj = Bio::SeqIO->new(-file => $ffn_infile, -format => 'fasta',
-alphabet => 'dna');
while (my $each_gene = $ffn_obj->next_seq){
# parse prodigal ffn fasta headers
my $gene_id = $each_gene->display_id;
$gene_id =~ /^(.*)_\d+$/ or die "Cannot match gene_id: $gene_id";
my $contig_id = $1;
my $gene_desc = $each_gene->desc;
$gene_desc =~ /\# (\d+) \# (\d+) # (-1|1) # .*;partial=(0|1)(0|1);/ or
die "Cannot match gene_desc: $gene_desc";
my $start = $1;
my $end = $2;
my $orientation = $3;
my $five_partial = $4;
my $three_partial = $5;
# accommodates partial genes
##############################################
# comment out these next 6 lines to reproduce the error
##############################################
if ($five_partial == 1){
$start = "<".$start;
}
if ($three_partial == 1){
$end = ">".$end;
}
# creates annotation feature for this gene
my $feature = new Bio::SeqFeature::Generic(
-start => $start,
-end => $end,
-strand => $orientation,
-primary_tag => 'CDS',
);
# hash of arrays, keyed by contig_id
push @{$genes{$contig_id}{gene_obj}}, $each_gene;
push @{$genes{$contig_id}{feature_obj}}, $feature;
push @{$genes{$contig_id}{gene_id}}, $gene_id;
push @{$genes{$contig_id}{locus_id}}, $locus_id;
}
# read through fna, annotating and outputting genes and prots
my $genome_gbk = Bio::SeqIO->new(-file => ">$output_prefix.gbk", -format
=> 'genbank', alphabet => 'dna' );
my $fna_obj = Bio::SeqIO->new(-file => $fna_infile, -format => 'fasta',
-alphabet => 'dna');
while (my $contig = $fna_obj->next_seq){
my $contig_id = $contig->display_id;
for my $a (0..$#{$genes{$contig_id}{gene_obj}}){
$contig->add_SeqFeature($genes{$contig_id}{feature_obj}[$a]);
}
$genome_gbk->write_seq($contig);
}
More information about the Bioperl-l
mailing list