[EMBOSS] Sequence annotation parsing and format conversion
Daniel Rozenbaum
drozenbaum at yahoo.com
Tue Aug 14 17:59:14 UTC 2012
Greetings,
My sincerest apologies if this question has already been addressed here:
I'm trying to understand how EMBOSS works with sequence annotation. Here's an example (I'm using EMBOSS 6.4.0.0):
I have a sequence in GENBANK format with extensive annotation, stored in a file/tmp/W02578.genbank (sequence listing at the end of this email). I feed it through the seqret utility as follows:
seqret /abss/tmp/W02578.genbank -osformat2 genbank -feature Y -auto -osname W02578.emboss_genbank2genbank -osdirectory /tmp
In the resultant file parts of the sequence annotation, such as fields AUTHORS, TITLE, COMMENT, and BASE COUNT are omitted, and values of some of the other fields are modified.
I understand that entret is the tool to use when one is interested in the sequence record as is, but what I'm trying to understand is whether it is EMBOSS's parsing and internal representation of the sequence data where parts of the annotation are omitted, and whether it's necessarily the case that some of the annotation fields are going to be lost/modified when converting between formats as well?
Many thanks,
Daniel
=== start /tmp/W02578.genbank ===
LOCUS W02578 644 bp mRNA EST 18-APR-1996
DEFINITION za52e02.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone
296186 5'.
ACCESSION W02578
NID g1274623
KEYWORDS EST.
SOURCE human.
ORGANISM Homo sapiens
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 644)
AUTHORS Hillier,L., Clark,N., Dubuque,T., Elliston,K., Hawkins,M.,
Holman,M., Hultman,M., Kucaba,T., Le,M., Lennon,G., Marra,M.,
Parsons,J., Rifkin,L., Rohlfing,T., Soares,M., Tan,F.,
Trevaskis,E., Waterston,R., Williamson,A., Wohldmann,P. and
Wilson,R.
TITLE The WashU-Merck EST Project
JOURNAL Unpublished (1995)
COMMENT
Contact: Wilson RK
WashU-Merck EST Project
Washington University School of Medicine
4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
Tel: 314 286 1800
Fax: 314 286 1810
Email: est at watson.wustl.edu
This clone is available royalty-free through LLNL ; contact the
IMAGE Consortium (info at image.llnl.gov) for further information.
Seq primer: mob.REGA+ET
High quality sequence stop: 320.
FEATURES Location/Qualifiers
source 1..644
/organism="Homo sapiens"
/note="Organ: Liver and Spleen; Vector: pT7T3D (Pharmacia)
with a modified polylinker; Site_1: Pac I; Site_2: Eco RI;
1st strand cDNA was primed with a Pac I - oligo(dT) primer
[5' AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT 3'],
double-stranded cDNA was ligated to Eco RI adaptors
(Pharmacia), digested with Pac I and cloned into the Pac I
and Eco RI sites of the modified pT7T3 vector. Library
went through one round of normalization. Library
constructed by Bento Soares and M.Fatima Bonaldo."
/clone="296186"
/clone_lib="Soares fetal liver spleen 1NFLS"
/sex="male"
/dev_stage="20 week-post conception fetus"
/lab_host="DH10B (ampicillin resistant)"
mRNA <1..>644
BASE COUNT 176 a 140 c 148 g 172 t 8 others
ORIGIN
1 acgatgatga caatgaaatt agtgcctgtt ttcttgcaaa tttagcactt ggaacattta
61 aagaaaggtc tatgctgtca tatggggttt attgggaact atcctcctgg ccccaccctg
121 ccccttcttt ttggttttga catcattcat ttccacctgg gaatttctgg tgccatgcca
181 gaaagaatga ggaacctgta ttcctcttct tcgtgataat ataatctcta tttttttagg
241 aaaacaaaaa tgaaaaacta ctccatttga ggattgtaat tcccacccct cttgcttctt
301 ccccacctca ccatctccca gaccctcttc ccttctgtct tctcctccaa tacataaaag
361 gacacagaca aggaactttg ctggaaaggg gnaacccatt ttcagggatc aggtcaaagg
421 gcaagcaagc aggatagact cnaggtgtgt gaaatatgtt atacaccagg aggctggcac
481 tggnatggtc ccaaacaaga atggtgtccg tctggggtct ggaatgtaag agttaaggga
541 agggaangaa gggactacaa gangagtcgg agatggatga nggaaacaac acaatttccc
601 aggccagtga tgcttgtggt gnacagntgt tcccgaggtc gggg
//
=== end /tmp/W02578.genbank ===
=== start /tmp/W02578.emboss_genbank2genbank ===
LOCUS W02578 644 bp DNA linear UNC 14-AUG-2012
DEFINITION za52e02.r1 Soares fetal liver spleen 1NFLS Homo sapiens cDNA clone
296186 5'.
ACCESSION W02578
KEYWORDS EST.
SOURCE human.
ORGANISM human.
REFERENCE 1 (bases 1 to 644)
FEATURES Location/Qualifiers
source 1..644
/organism="Homo sapiens"
/note="Organ: Liver and Spleen; Vector: pT7T3D (Pharmacia)
with a modified polylinker; Site_1: Pac I; Site_2: Eco RI;
1st strand cDNA was primed with a Pac I - oligo(dT) primer
[5' AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT 3'],
double-stranded cDNA was ligated to Eco RI adaptors
(Pharmacia), digested with Pac I and cloned into the Pac I
and Eco RI sites of the modified pT7T3 vector. Library
went through one round of normalization. Library
constructed by Bento Soares and M.Fatima Bonaldo."
/clone="296186"
/clone_lib="Soares fetal liver spleen 1NFLS"
/sex="male"
/dev_stage="20 week-post conception fetus"
/lab_host="DH10B (ampicillin resistant)"
mRNA <1..>644
ORIGIN
1 acgatgatga caatgaaatt agtgcctgtt ttcttgcaaa tttagcactt ggaacattta
61 aagaaaggtc tatgctgtca tatggggttt attgggaact atcctcctgg ccccaccctg
121 ccccttcttt ttggttttga catcattcat ttccacctgg gaatttctgg tgccatgcca
181 gaaagaatga ggaacctgta ttcctcttct tcgtgataat ataatctcta tttttttagg
241 aaaacaaaaa tgaaaaacta ctccatttga ggattgtaat tcccacccct cttgcttctt
301 ccccacctca ccatctccca gaccctcttc ccttctgtct tctcctccaa tacataaaag
361 gacacagaca aggaactttg ctggaaaggg gnaacccatt ttcagggatc aggtcaaagg
421 gcaagcaagc aggatagact cnaggtgtgt gaaatatgtt atacaccagg aggctggcac
481 tggnatggtc ccaaacaaga atggtgtccg tctggggtct ggaatgtaag agttaaggga
541 agggaangaa gggactacaa gangagtcgg agatggatga nggaaacaac acaatttccc
601 aggccagtga tgcttgtggt gnacagntgt tcccgaggtc gggg
//
=== end /tmp/W02578.emboss_genbank2genbank ===
More information about the EMBOSS
mailing list