[EMBOSS] Problems with Refseq

Enrique de Andres Saiz enrique.deandres at pcm.uam.es
Wed Nov 2 10:29:35 UTC 2005


Hi,
I have some problems working with Refseq database.
I am using Emboss 3.0.0 and when I am indexing the database (.g?ff 
files) using dbiflat command I get many warnings as:

Warning: Duplicate ID skipped: 'NP_857944' All hits will point to first 
ID found

Another problem is that when I try to get an entry using seqret command, 
I get another sequence with the accession I have selected.  When I try 
to get the entry using entret, I get several sequences.

I have tried to index only one file of the database and then access it 
with seqret and entret. I get the same behaviour. For example, I have 
next definition in emboss.default file:

DB rs_test [
    type: N
    method: emblcd
    format: genbank
    dir: $emboss_data/refseq
    file: vertebrate_mammalian2.genomic.gbff
    indexdir: /usr/users/bioadmin/opt/prueba
    comment: "RefSeq test"
]

If I edit file vertebrate_mammalian2.genomic.gbff, I can see next entry:

LOCUS       NW_113053               1059 bp    DNA     linear   CON 
09-NOV-2004
DEFINITION  Pan troglodytes chromosome 10 genomic contig, whole genome 
shotgun
            sequence.
ACCESSION   NW_113053
VERSION     NW_113053.1  GI:52318716
KEYWORDS    WGS.
SOURCE      Pan troglodytes (chimpanzee)
  ORGANISM  Pan troglodytes
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; 
Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini;
            Hominidae; Pan.
COMMENT     GENOME ANNOTATION REFSEQ:  Features on this sequence have been
            produced for build 1 version 1 of the NCBI's genome 
annotation [see
            documentation].
            The DNA sequence for this assembly was produced by the 
Chimpanzee
            Genome Sequencing Consortium. This assembly was produced by the
            Arachne assembler and made available in Nov. 2003.
FEATURES             Location/Qualifiers
     source          1..1059
                     /organism="Pan troglodytes"
                     /mol_type="genomic DNA"
                     /isolate="Yerkes chimp pedigree #C0471 (Clint)"
                     /db_xref="taxon:9598"
                     /chromosome="10"
CONTIG      join(AADA01324841.1:1..1059)

If I run: seqret rs_test:NW_113053, I get:

$> seqret rs_test:NW_113053
Reads and writes (returns) sequences
Output sequence [nw_113053.fasta]: stdout
 >NW_113053 NW_113053.1 Pan troglodytes olfactory receptor pseudogene 
PTOR3A5P (PTOR3A5P) onchromosome 17.
ggaacgtactgcagcccatccgttttgctgtcttccgctttgcctacatcatcatagttg
ggggcaacctcagcatcctggctgccatctttgtggaccccaaactccatactcccatgt
attacttcctggggaacttgtctctgctggacatcgggtgcatcagtcactgttcctccg
atgctggcgtgtctcctggcccaccagtgcagagttccctatgctgcctgcatttcacaa
ctcttctttttccacctcctggctggggtggactgtcacctcttaatagccacggcctat
gactgctacctggctatctgtcagcttctcaccaacagcactcgcatgagctgtgaagtc
cagggtgccctggtgggaatttgctgcactgtctccttcatcaatgctctgactcacaca
gtggctgtgtctgtgcttgacttctgtggccctaatgtggtcaaccacttctgctgtgac
ctcccacctcttttccagctctcttgctccagcatccacctcaatgggcagctgctgctt
gtgggggccaccttcataggagtgctccccatgatctttatctcagtgtcctatgcccac
gtcacagccgcaatattacgaatccgctcagctgaggggaggaagaaggctttctccacg
tgtggctcccacctcaccgtggtctgaatcttttatggaactggcttcttcagttacatg
tgtctgggctcagtctcagcctcagacaaagataaggggattgggatcctcaacactatc
ctcagtcccatgctgaacccagtcatttacagcctccagaaccctgatgtgcagggcacc
ctgaaaagggtgctgacagggaagaggcccccagcttga

If I run: entret rs_test:NW_113053, I get several entries (the first one 
is the correct one).

Any idea about what happens and how can I solve it?

Thanks in advance,
Enrique.



More information about the EMBOSS mailing list