[Bioperl-l] Genbank files with CONTIG lines in them.

Govind Chandra govind.chandra at jic.ac.uk
Wed Dec 11 12:15:34 UTC 2013


Hi,

Some Genbank files have a line beginning with "CONTIG" as shown below.

.
.
.
                     /protein_id="YP_008390690.1"
                     /db_xref="GI:529229870"
                     /db_xref="GeneID:16501453"
                     /translation="MSAEATPNTGEVQRYVKGLGRAASFVAGLVVLAFAADCIPPWPF
                     VTEDGSPAKLRRLGMLRCPACGLMSNREHRRLCRGPWRAGEDVST"
CONTIG      join(CP006261.1:1..19314)
ORIGIN      
        1 ggggggcaga ggccatgcgg ctacgccgcg tcacctccgg gcctgcggcc ctcacggacg
       61 gtgacggtca ctctccgcgg tcgtgcctac ggcacatccc cgccgccgtg tcaacccccg
      121 cgcgcaactt ttccccgaca acctgcggtt gtcgtccgcc gtcccgggac cgcacccccc
      181 acccgatcac cccccaccgg ccgggctacg cccacggccg gcccctcggc cgtctgtggc
      241 ccacaggttc cccccgccgc ctacggcgtc tcgtccgggc ataccccccc ctgctacgcc
      301 accccaccga acgcgccgag cccgcaaagg ccggcggcgc gtcggccgac acactccgtc
      361 tgtccccgtg aggctgcggg tatcggccat gcctggcctg ccctgcttcg ccgctcggcc
.
.
.



If the CONTIG line is present in a Genbank file then the string
returned by the Bio::Seq->seq() method is zero-length or undefined (I
haven't checked which).

I made two versions of the same genbank file, one with the CONTIG line
and one without. Then I ran the script pasted below.


### Code begins ###

use strict;
use Bio::SeqIO;


for my $gbkfile (qw(withContigLine.gbk withoutContigLine.gbk)) {

my $seqin = Bio::SeqIO->new(-file => $gbkfile);
my $seqobj = $seqin->next_seq();
my $ntseq = $seqobj->seq();
my $strlen = length($ntseq);
my $bplen = $seqobj->length();

print <<"REPORT";
$gbkfile

Bioperl reports length as $bplen.
Length of the sequence string is $strlen.

=========================================

REPORT

}

print("Perl version is: $]\n");
print("Bioperl version is: ", $Bio::SeqIO::VERSION, "\n");
printf "Bioperl version again: %vd\n", $Bio::SeqIO::VERSION;

exit;

### Code Ends ###

The output from the above script is pasted below.


### Output begins ###

withContigLine.gbk

Bioperl reports length as 19314.
Length of the sequence string is .

=========================================

withoutContigLine.gbk

Bioperl reports length as 19314.
Length of the sequence string is 19314.

=========================================

Perl version is: 5.018000
Bioperl version is: 1.006001
Bioperl version again: 49.46.48.48.54.48.48.49

### Output ends ###


Do I have to do something different to get the sequence string from
Genbank files which have the CONTIG line in them?

Any suggestions will be most gratefully received.

Thanks

Govind

Govind Chandra
Molecular Microbiology
John Innes Centre
Norwich UK.










More information about the Bioperl-l mailing list