[Bioperl-l] Genbank files with CONTIG lines in them.
Govind Chandra
govind.chandra at jic.ac.uk
Wed Dec 11 12:15:34 UTC 2013
Hi,
Some Genbank files have a line beginning with "CONTIG" as shown below.
.
.
.
/protein_id="YP_008390690.1"
/db_xref="GI:529229870"
/db_xref="GeneID:16501453"
/translation="MSAEATPNTGEVQRYVKGLGRAASFVAGLVVLAFAADCIPPWPF
VTEDGSPAKLRRLGMLRCPACGLMSNREHRRLCRGPWRAGEDVST"
CONTIG join(CP006261.1:1..19314)
ORIGIN
1 ggggggcaga ggccatgcgg ctacgccgcg tcacctccgg gcctgcggcc ctcacggacg
61 gtgacggtca ctctccgcgg tcgtgcctac ggcacatccc cgccgccgtg tcaacccccg
121 cgcgcaactt ttccccgaca acctgcggtt gtcgtccgcc gtcccgggac cgcacccccc
181 acccgatcac cccccaccgg ccgggctacg cccacggccg gcccctcggc cgtctgtggc
241 ccacaggttc cccccgccgc ctacggcgtc tcgtccgggc ataccccccc ctgctacgcc
301 accccaccga acgcgccgag cccgcaaagg ccggcggcgc gtcggccgac acactccgtc
361 tgtccccgtg aggctgcggg tatcggccat gcctggcctg ccctgcttcg ccgctcggcc
.
.
.
If the CONTIG line is present in a Genbank file then the string
returned by the Bio::Seq->seq() method is zero-length or undefined (I
haven't checked which).
I made two versions of the same genbank file, one with the CONTIG line
and one without. Then I ran the script pasted below.
### Code begins ###
use strict;
use Bio::SeqIO;
for my $gbkfile (qw(withContigLine.gbk withoutContigLine.gbk)) {
my $seqin = Bio::SeqIO->new(-file => $gbkfile);
my $seqobj = $seqin->next_seq();
my $ntseq = $seqobj->seq();
my $strlen = length($ntseq);
my $bplen = $seqobj->length();
print <<"REPORT";
$gbkfile
Bioperl reports length as $bplen.
Length of the sequence string is $strlen.
=========================================
REPORT
}
print("Perl version is: $]\n");
print("Bioperl version is: ", $Bio::SeqIO::VERSION, "\n");
printf "Bioperl version again: %vd\n", $Bio::SeqIO::VERSION;
exit;
### Code Ends ###
The output from the above script is pasted below.
### Output begins ###
withContigLine.gbk
Bioperl reports length as 19314.
Length of the sequence string is .
=========================================
withoutContigLine.gbk
Bioperl reports length as 19314.
Length of the sequence string is 19314.
=========================================
Perl version is: 5.018000
Bioperl version is: 1.006001
Bioperl version again: 49.46.48.48.54.48.48.49
### Output ends ###
Do I have to do something different to get the sequence string from
Genbank files which have the CONTIG line in them?
Any suggestions will be most gratefully received.
Thanks
Govind
Govind Chandra
Molecular Microbiology
John Innes Centre
Norwich UK.
More information about the Bioperl-l
mailing list