[EMBOSS] Sequence annotation parsing and format conversion

Daniel Rozenbaum drozenbaum at yahoo.com
Tue Aug 14 17:59:14 UTC 2012


Greetings,

My sincerest apologies if this question has already been addressed here:


I'm trying to understand how EMBOSS works with sequence annotation. Here's an example (I'm using EMBOSS 6.4.0.0):

I have a sequence in GENBANK format with extensive annotation, stored in a file/tmp/W02578.genbank (sequence listing at the end of this email). I feed it through the seqret utility as follows:


seqret /abss/tmp/W02578.genbank -osformat2 genbank -feature Y -auto -osname W02578.emboss_genbank2genbank -osdirectory /tmp


In the resultant file parts of the sequence annotation, such as fields AUTHORS, TITLE, COMMENT, and BASE COUNT are omitted, and values of some of the other fields are modified.

I understand that entret is the tool to use when one is interested in the sequence record as is, but what I'm trying to understand is whether it is EMBOSS's parsing and internal representation of the sequence data where parts of the annotation are omitted, and whether it's necessarily the case that some of the annotation fields are going to be lost/modified when converting between formats as well?

Many thanks,
Daniel



=== start /tmp/W02578.genbank ===

LOCUS       W02578        644 bp    mRNA            EST       18-APR-1996
DEFINITION  za52e02.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone
            296186 5'.
ACCESSION   W02578
NID         g1274623
KEYWORDS    EST.
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
            Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 644)
  AUTHORS   Hillier,L., Clark,N., Dubuque,T., Elliston,K., Hawkins,M.,
            Holman,M., Hultman,M., Kucaba,T., Le,M., Lennon,G., Marra,M.,
            Parsons,J., Rifkin,L., Rohlfing,T., Soares,M., Tan,F.,
            Trevaskis,E., Waterston,R., Williamson,A., Wohldmann,P. and
            Wilson,R.
  TITLE     The WashU-Merck EST Project
  JOURNAL   Unpublished (1995)
COMMENT
            Contact: Wilson RK
            WashU-Merck EST Project
            Washington University School of Medicine
            4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
            Tel: 314 286 1800
            Fax: 314 286 1810
            Email: est at watson.wustl.edu
            This clone is available royalty-free through LLNL ; contact the
            IMAGE Consortium (info at image.llnl.gov) for further information.
            Seq primer: mob.REGA+ET
            High quality sequence stop: 320.
FEATURES             Location/Qualifiers
     source          1..644
                     /organism="Homo sapiens"
                     /note="Organ: Liver and Spleen; Vector: pT7T3D (Pharmacia)
                     with a modified polylinker; Site_1: Pac I; Site_2: Eco RI;
                     1st strand cDNA was primed with a Pac I - oligo(dT) primer
                     [5' AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT 3'],
                     double-stranded cDNA was ligated to Eco RI adaptors
                     (Pharmacia), digested with Pac I and cloned into the Pac I
                     and Eco RI sites of the modified pT7T3 vector.  Library
                     went through one round of normalization. Library
                     constructed by Bento Soares and M.Fatima Bonaldo."
                     /clone="296186"
                     /clone_lib="Soares fetal liver spleen 1NFLS"
                     /sex="male"
                     /dev_stage="20 week-post conception fetus"
                     /lab_host="DH10B (ampicillin resistant)"
     mRNA            <1..>644
BASE COUNT      176 a    140 c    148 g    172 t      8 others
ORIGIN
        1 acgatgatga caatgaaatt agtgcctgtt ttcttgcaaa tttagcactt ggaacattta
       61 aagaaaggtc tatgctgtca tatggggttt attgggaact atcctcctgg ccccaccctg
      121 ccccttcttt ttggttttga catcattcat ttccacctgg gaatttctgg tgccatgcca
      181 gaaagaatga ggaacctgta ttcctcttct tcgtgataat ataatctcta tttttttagg
      241 aaaacaaaaa tgaaaaacta ctccatttga ggattgtaat tcccacccct cttgcttctt
      301 ccccacctca ccatctccca gaccctcttc ccttctgtct tctcctccaa tacataaaag
      361 gacacagaca aggaactttg ctggaaaggg gnaacccatt ttcagggatc aggtcaaagg
      421 gcaagcaagc aggatagact cnaggtgtgt gaaatatgtt atacaccagg aggctggcac
      481 tggnatggtc ccaaacaaga atggtgtccg tctggggtct ggaatgtaag agttaaggga
      541 agggaangaa gggactacaa gangagtcgg agatggatga nggaaacaac acaatttccc
      601 aggccagtga tgcttgtggt gnacagntgt tcccgaggtc gggg
//
=== end /tmp/W02578.genbank ===



=== start /tmp/W02578.emboss_genbank2genbank ===
LOCUS       W02578                   644 bp    DNA     linear   UNC 14-AUG-2012
DEFINITION  za52e02.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone
            296186 5'.
ACCESSION   W02578
KEYWORDS    EST.
SOURCE      human.
  ORGANISM  human.
REFERENCE   1  (bases 1 to 644)
FEATURES             Location/Qualifiers
     source          1..644
                     /organism="Homo sapiens"
                     /note="Organ: Liver and Spleen; Vector: pT7T3D (Pharmacia)
                     with a modified polylinker; Site_1: Pac I; Site_2: Eco RI;
                     1st strand cDNA was primed with a Pac I - oligo(dT) primer
                     [5' AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT 3'],
                     double-stranded cDNA was ligated to Eco RI adaptors
                     (Pharmacia), digested with Pac I and cloned into the Pac I
                     and Eco RI sites of the modified pT7T3 vector. Library
                     went through one round of normalization. Library
                     constructed by Bento Soares and M.Fatima Bonaldo."
                     /clone="296186"
                     /clone_lib="Soares fetal liver spleen 1NFLS"
                     /sex="male"
                     /dev_stage="20 week-post conception fetus"
                     /lab_host="DH10B (ampicillin resistant)"
     mRNA            <1..>644
ORIGIN
       1  acgatgatga caatgaaatt agtgcctgtt ttcttgcaaa tttagcactt ggaacattta
      61  aagaaaggtc tatgctgtca tatggggttt attgggaact atcctcctgg ccccaccctg
     121  ccccttcttt ttggttttga catcattcat ttccacctgg gaatttctgg tgccatgcca
     181  gaaagaatga ggaacctgta ttcctcttct tcgtgataat ataatctcta tttttttagg
     241  aaaacaaaaa tgaaaaacta ctccatttga ggattgtaat tcccacccct cttgcttctt
     301  ccccacctca ccatctccca gaccctcttc ccttctgtct tctcctccaa tacataaaag
     361  gacacagaca aggaactttg ctggaaaggg gnaacccatt ttcagggatc aggtcaaagg
     421  gcaagcaagc aggatagact cnaggtgtgt gaaatatgtt atacaccagg aggctggcac
     481  tggnatggtc ccaaacaaga atggtgtccg tctggggtct ggaatgtaag agttaaggga
     541  agggaangaa gggactacaa gangagtcgg agatggatga nggaaacaac acaatttccc
     601  aggccagtga tgcttgtggt gnacagntgt tcccgaggtc gggg
//

=== end /tmp/W02578.emboss_genbank2genbank ===



More information about the EMBOSS mailing list