[Bioperl-l] Bug in GCG SeqIO Formatting?

Tex Thompson tex at biosysadmin.com
Mon Feb 16 17:49:04 EST 2004


Hello Mailing List,

I have a user complaining that the following code isn't working on his
GCG-formatted sequence files:

#!/usr/bin/perl

use strict;

use Bio::SeqIO; 
my $io  = Bio::SeqIO->new( -file => "af317472.gbpln3", -format => "gcg");
my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => "fasta" );

while ( my $seq = $io->next_seq ) {
   $out->write_seq( $seq );
}

Here's an example sequence file:

!!NA_SEQUENCE 1.0
LOCUS       AF317472                2679 bp    DNA     linear   PLN 07-DEC-2000
DEFINITION  Candida albicans cAMP-dependent protein kinase regulatory subunit
            (PKA-R) gene, complete cds.
ACCESSION   AF317472
VERSION     AF317472.1  GI:11596392
KEYWORDS    .
SOURCE      Candida albicans
  ORGANISM  Candida albicans
            Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
            Saccharomycetales; mitosporic Saccharomycetales; Candida.
REFERENCE   1  (bases 1 to 2679)
  AUTHORS   Giasson,L. and Parrot,M.
  TITLE     Sequence of the Candida albicans cAMP-dependent protein kinase
            regulatory subunit
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 2679)
  AUTHORS   Giasson,L. and Parrot,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (27-OCT-2000) School of Dentistry, Laval University,
            GREB, Ste-Foy, Quebec G1K 7P4, Canada
FEATURES             Location/Qualifiers
     source          1. .2679
                     /organism="Candida albicans"
                     /mol_type="genomic DNA"
                     /strain="CAI4"
                     /db_xref="taxon:5476"
     gene            <977. .>2356
                     /gene="PKA-R"
     mRNA            <977. .>2356
                     /gene="PKA-R"
                     /product="cAMP-dependent protein kinase regulatory
                     subunit"
     CDS             977. .2356
                     /gene="PKA-R"
                     /codon_start=1
                     /transl_table=12
                     /product="cAMP-dependent protein kinase regulatory
                     subunit"
                     /protein_id="AAG38599.1"
                     /db_xref="GI:11596393"
                     /translation="MSNPQQQFISDELSQLQKEIISKNPQDVLQFCANYFNTKLQAQR
                     SELWSQQAKAEAAGIDLFPSVDHVNVNSSGVSIVNDRQPSFKSPFGVNDPHSNHDEDP
                     HAKDTKTDTAAAAVGGGIFKSNFDVKKSASNPPTKEVDPDDPSKPSSSSQPNQQSASA
                     SSKTPSSKIPVAFNANRRTSVSAEALNPAKLKLDSWKPPVNNLSITEEETLANNLKNN
                     FLFKQLDANSKKTVIAALQQKSFAKDTVIIQQGDEGDFFYIIETGTVDFYVNDAKVSS
                     SSEGSSFGELALMYNSPRAATAVAATDVVCWALDRLTFRRILLEGTFNKRLMYEDFLK
                     DIEVLKSLSDHARSKLADALSTEMYHKGDKIVTEGEQGENFYLIESGNCQVYNEKLGN
                     IKQLTKGDYFGELALIKDLPRQATVEALDNVIVATLGKSGFQRLLGPVVEVLKEQDPT
                     KSQDPTAGH"
ORIGIN

AF317472  Length: 2679  February 16, 2004 17:02  Type: N  Check: 9369  ..

       1  GAATTCAAAA AATCAAAAAA ATCAAAAAAA AACCGTGGAA GGTAAGTTGT 

      51  ATATTTATAA ATCAACGTGA ATAATTTTCA ACACTGTGTC AACATCTGTG 

     101  AAAAAAACCT GTGTGTACTG CATATAGGAC CTCACCTATT ACGTAGAATA 

     151  TACTAGAAAT AGTTACAACC ATAAAAAGAT TAATTGTGCT TACGTGGCAA 

     201  CTTTGAGATT TTTCTTTTTT CTGTTTCTTT CTTTCTTTTT TTGGCTTAAA 

     251  CAACAAATGT CGCAAATTAT ACAAACGACA TTTGCTGCCC ATGTCATTTT 

     301  GTCGTTATCA CGTGAAGTGT CGCAGATTTA TGTATTCTCA CTTCATTTCT 

     351  ATGGTCATCA ATTGTTCATT CATTCTCTAT CTTCAAAAAT CTGTGATTTG 

     401  ATGATTTTGA TTAAAAGAAA GCAAAGAGAA TACTGAAAAA AAGCAAAGAG 

     451  AATATAGAAA AGAAACAATA AAAGAATAGT TTCTAAGTTA CTTTGGAGTC 

     501  TGCTATTACC ATGTATCTAT GTGATTGCCC TATCAAATTG GACAATACGG 

     551  GTTTTTGTTT AGTCACGATA ATCACAAACT TCCCCCAGCA ATGACATACG 

     601  TAGCAAGTAA TATTTATATC TCTTCTATTT TTTTGATCTT ACATAATCTG 

     651  TCGTGTTTTT TTAAGTTGTT GTTATGAAGA AGTAATTTCA TAATGATCAA 

     701  GTGTGTAACT GAAATTTCAT CGCAATTTTA AACAAACAAG CTAATAATTA 

     751  TTATTATTAA TAGTTAATTT GCTAAGTTGA GTAAAATTTG CTTTTCTTGA 

     801  GAAAAAGGAG AAATTACTTT GGGAGTGAGT TTGAAGAGAG AAACTAAAGT 

     851  AAGTAAATGA GTGAGAGGGA GAGACAGAGA GCGAGAGGGG GAGTAAAAAA 

     901  AAAAGTTGCC CACAAACAAA TTGTGATACC GGTCTTTTAG CATATATCTT 

     951  CTACTCTTCA ATCAACATCT TTACCAATGT CTAATCCTCA ACAACAATTC 

    1001  ATATCTGATG AATTGTCGCA GTTACAGAAA GAAATAATTT CCAAAAACCC 

    1051  GCAAGATGTC TTACAGTTTT GCGCCAACTA TTTCAACACC AAGTTACAAG 

    1101  CTCAAAGAAG TGAGTTATGG TCGCAACAAG CTAAAGCAGA AGCCGCAGGC 

    1151  ATCGACTTAT TCCCATCTGT TGATCATGTG AATGTTAATT CTAGTGGTGT 

    1201  GAGCATTGTG AATGATAGAC AACCAAGTTT TAAATCACCT TTTGGTGTTA 

    1251  ATGATCCACA TCTGAATCAC GACGAAGATC CCCATGCCAA AGATACCAAA 

    1301  ACAGATACTG CTGCTGCTGC TGTTGGTGGG GGTATTTTCA AATCAAATTT 

    1351  TGATGTTAAA AAGAGTGCTT CTAATCCTCC AACCAAGGAA GTAGATCCAG 

    1401  ATGACCCATC AAAACCATCG TCATCGAGCC AACCAAATCA ACAATCAGCA 

    1451  TCAGCATCAT CAAAAACGCC ATCATCAAAG ATCCCAGTTG CTTTCAACGC 

    1501  TAATAGAAGA ACATCTGTAT CTGCTGAAGC CTTGAATCCA GCAAAATTGA 

    1551  AATTAGATAG TTGGAAACCT CCAGTTAATA ATTTGAGCAT TACCGAAGAA 

    1601  GAAACATTAG CCAACAATTT AAAGAACAAT TTCCTTTTCA AACAATTGGA 

    1651  CGCAAACTCT AAGAAAACTG TGATTGCTGC TTTACAACAA AAATCATTTG 

    1701  CTAAAGATAC AGTAATTATC CAACAAGGTG ATGAAGGGGA CTTTTTTTAC 

    1751  ATTATTGAAA CTGGTACAGT TGATTTCTAT GTTAATGATG CTAAAGTAAG 

    1801  TTCCAGTAGC GAAGGGTCAT CTTTTGGGGA ATTGGCTTTG ATGTATAATT 

    1851  CACCAAGAGC TGCTACGGCA GTTGCTGCCA CCGATGTTGT CTGTTGGGCA 

    1901  TTGGACCGTT TGACATTCCG TCGAATTCTT TTGGAAGGTA CTTTTAACAA 

    1951  GAGATTGATG TACGAGGATT TCTTAAAAGA TATTGAGGTT TTGAAATCTC 

    2001  TTTCGGATCA TGCACGTTCA AAATTGGCAG ATGCATTGAG CACAGAAATG 

    2051  TATCACAAGG GTGATAAAAT AGTCACTGAA GGTGAACAAG GAGAGAACTT 

    2101  TTATTTAATA GAAAGTGGAA ACTGTCAAGT TTACAATGAA AAGTTGGGCA 

    2151  ATATCAAACA ATTAACAAAA GGTGATTATT TTGGTGAGCT TGCATTAATA 

    2201  AAAGACTTAC CAAGACAAGC TACTGTGGAA GCATTGGATA ATGTAATCGT 

    2251  TGCCACATTA GGTAAATCCG GGTTCCAAAG ATTATTGGGT CCTGTTGTGG 

    2301  AGGTATTGAA AGAACAAGAC CCTACAAAGA GTCAAGACCC AACTGCTGGT 

    2351  CATTAAGTGT ACAATAAGTA GTTGTTTATT ATCTTATATT GTTTTATGTT 

    2401  AGTATATTCT ATCTTTTTTT TTTTGGCTTA CTCACCTTCT GGTGTTTTCG 

    2451  TTGCGATTTT GATAATGGAT GGTTGGTGCA AAAGTTCAAC TACATTTCTT 

    2501  GTTGTCAGGT ATATACGAGA TGGCAGCATG AACGAGCTCA CCATGGGTTG 

    2551  AACATTATTG AAGTTATCCG GCCGTGCCTT TTGCGAAACA TGGTAACTAA 

    2601  TATATTGCAA ACTTGGCTTC TACAGAAAAT ATACAATCTA ATACCTTGAG 

    2651  GAATTTCCTC TATATATAAT AGAGAATTC

I'm not a GCG expert, but is this a correctly formatted GCG file in the first
place? If not, is this an error in the SeqIO parser?  I've found this behavior
to be the same on Solaris 8 and on Linux, both running BioPerl 1.4 and Perl
5.8.1.

Thanks a bunch,

Tex Thompson
RIT Bioinformatics



More information about the Bioperl-l mailing list