[EMBOSS] problems installing/using TrEMBL

Fernan Aguero fernan at iib.unsam.edu.ar
Thu Oct 4 22:41:44 UTC 2007


George, 

thanks for your points.

| Maybe you are missing the resource record in the emboss.default file for 
| the trembl databank and you have passed the wrong arguments to dbxflat. 

I have this resource record in my emboss.default conf

RES embl [ type: Index
  idlen:  15
  acclen: 15
  svlen:  15
  keylen: 25
  deslen: 25
  orglen: 25
]

|   You should choose the emboss method in the DB entry. 

OK

| Then, the 
| emboss.default file should contain also a resource entry for trembl:
| 
| RES trembl [
|     type: Index
|     idlen:  15
|     acclen: 15
|     svlen:  20
|     keylen: 30
|     deslen: 25
|     orglen: 25
| ]

Does the name of the resource matter? Mine is named 'embl' ...

|  From your dbxflat output you quote I can see that the command points to 
| the embl resource:
| 
| [root at alfa trembl]# dbxflat -dbname trembl -idformat EMBL <--- Why EMBL?

What other options are there SWISS? GCG? GENBANK? This is AFAIK an
EMBL formatted file. But maybe I'm wrong ...

| -directory . -filenames uniprot_trembl.dat -release "37.0"
| -date "24/07/07" -fields sv,acc,des,key,orgDatabase b+tree
| indexing for flat file databases
| Resource name: embl  <--- That should say trembl, Why did you choose 
| embl here?

Because the resource in my emboss.default file is named 'embl'.

| 
| When the dbxflat command asked you for a resource name, you really 
| should have a trembl RES entry and I am not sure that your idformat 
| (EMBL) is correct.
|
| GM
| --
| George Magklaras

Mmm ... maybe it's SWISS then?

>From the dbxflat docs:
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: 

Thanks for your questions and pointers. I'm running dbxflat
overnight again to see if this makes any difference
(-idformat SWISS -resource trembl, with a new trembl RES
line added to emboss.default). But so far, only 6 trembl.*
files are being produced and none of them is called
trembl.pxid (as per the error in my original message, see
below).

[root at alfa trembl]# ls trembl.*
trembl.ent  trembl.xac  trembl.xde  trembl.xkw  trembl.xsv trembl.xtx

Fernan

PS: this is the first entry in my uniprot_trembl.dat file

[fernan at alfa trembl]$ head -45 uniprot_trembl.dat 
ID   A0B532_METTP            Unreviewed;       337 AA.
AC   A0B532;
DT   28-NOV-2006, integrated into UniProtKB/TrEMBL.
DT   28-NOV-2006, sequence version 1.
DT   24-JUL-2007, entry version 6.
DE   RNA-3'-phosphate cyclase (EC 6.5.1.4).
GN   OrderedLocusNames=Mthe_0003;
OS   Methanosaeta thermophila (strain DSM 6194 / PT) (Methanothrix
OS   thermophila (strain DSM 6194 / PT)).
OC   Archaea; Euryarchaeota; Methanomicrobia; Methanosarcinales;
OC   Methanosaetaceae; Methanosaeta.
OX   NCBI_TaxID=349307;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RG   US DOE Joint Genome Institute;
RA   Copeland A., Lucas S., Lapidus A., Barry K., Detter J.C.,
RA   Glavina del Rio T., Hammon N., Israni S., Pitluck S., Chain P.,
RA   Malfatti S., Shin M., Vergez L., Schmutz J., Larimer F., Land M.,
RA   Hauser L., Kyrpides N., Kim E., Smith K.S., Ingram-Smith C.,
RA   Richardson P.;
RT   "Complete sequence of Methanosaeta thermophila PT.";
RL   Submitted (OCT-2006) to the EMBL/GenBank/DDBJ databases.
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; CP000477; ABK13806.1; -; Genomic_DNA.
DR   GenomeReviews; CP000477_GR; Mthe_0003.
DR   GO; GO:0003963; F:RNA-3'-phosphate cyclase activity; IEA:InterPro.
DR   InterPro; IPR000228; RNA3'_term_phos_cycl.
DR   InterPro; IPR013796; RNA3'_term_phos_cycl_insert.
DR   PANTHER; PTHR11096; RNA3'_term_phos_cycl; 1.
DR   Pfam; PF01137; RTC; 1.
DR   Pfam; PF05189; RTC_insert; 1.
DR   PROSITE; PS01287; RTC; 1.
PE   4: Predicted;
KW   Complete proteome; Ligase.
SQ   SEQUENCE   337 AA;  36340 MW;  69F26755A1B8DA03 CRC64;
     MNKPQMIEID GSYGEGGGQI VRTSVALSTL TGIPVRIKNI RRNRPRPGLA AQHVRAIEAL
     AQISRAETRG VHLGSEEIEF IPGRISAGSY DVDIGTAGSV TLLIQCLLPA LTAAEGPVTV
     TVRGGTDVRW SPTVDYLEHV ALPAMHLFGV TATFRCERRG YYPRGGGVVV LSTRPSRLRP
     ARLELIEEGI CGISHCGSLP EHVARRQADA ALELLKEKGY DARIDIQTMS SSSPGSGITL
     WSGFRGSSAL GERGVRAEDV GREAAKALID ELKSKASVDV HLADQLIPYI ALAGGEYTTR
     EISSHTRTNI WTAQRILRCR IDIDEGEVFR IHSTGSG
//


 
 
| Fernan Aguero wrote:
| >  
| > | On 2 Oct 2007, at 18:54, Fernan Aguero wrote:
| > | 
| > | > Hi,
| > | >
| > | > I've installed TrEMBL in EMBOSS and it seems like I'm having some
| > | > problems ...
| > | >
| > | > I've run dbiflat as follows:
| > | [snip]
| > | >
| > | > Now, when using seqret, it seems like I'm not getting the
| > | > records I expect, for example if I search for the first ID
| > | > in the example above (A0B532), I get A0BDZ0 instead:
| > | 
| > | I suspect your problem is that your trembl file is >2Gb in size.   
| > | Above this size dbiflat won't work properly and will give wacky  
| > | results such as the ones you've shown.  This won't be a problem with  
| > | uniprot_sprot.dat as this is still only about 1.1Gb.
| > | 
| > | Your choices are therefore:
| > | 
| > | 1) You could split your trembl file into multiple files, each smaller  
| > | than 2Gb.  This ends up being a complete pain, and you probably don't  
| > | want to do it this way.
| > | 
| > | 2) Use the newer dbx* family of indexing programs which can cope with  
| > | larger file sizes.  In your case you'd use dbxflat instead of  
| > | dbiflat.  There are some configuration differences between the two so  
| > | you should read 'tfm dbxflat' first, but they work pretty much the  
| > | same as the old versions.  We use the dbx programs for all of our  
| > | databases and they work fine.
| > | 
| > | Hope this helps
| > | 
| > | Simon.
| >  
| > Simon,
| > 
| > thanks for your suggestions. I've been waiting for dbxflat
| > to finish before replying ... thus the delay.
| > 
| > You mention that there are some configuration
| > differences between db(x|i)flat  ... I guess I've got into those
| > now ... even after reading tfm for dbxflat, it seems I can't
| > just set it up right
| > 
| > ===> Configuration
| > DB trembl [
| >         type: P
| >         comment: "TrEMBL 37.0"
| >         method: emblcd
| >         format: embl
| >         dbalias: trembl
| >         dir: /share/bio/emboss/trembl/
| >         file: uniprot_trembl.dat
| >         indexdirectory: /share/bio/emboss/trembl
| > ]
| > 
| > With this configuration, I get this error:
| > [fernan at alfa ~]$ seqret trembl:A0B532
| > Reads and writes (returns) sequences
| > Warning: Cannot open division file '<null>' for database 'trembl'
| > Warning: seqCdQry failed
| > Error: Unable to read sequence 'trembl:A0B532'
| > Died: seqret terminated: Bad value for '-sequence' and no prompt
| > 
| > If I change the 'method' to 'method: emboss'
| > as per the example in the dbxflat docs, I get this error:
| > 
| > [fernan at alfa ~]$ seqret trembl:A0B532
| > Reads and writes (returns) sequences
| > 
| >    EMBOSS An error in ajindex.c at line 3028:
| > Cannot open param file /share/bio/emboss/trembl/trembl.pxid
| > 
| > This file does not exist (see result of indexing below):
| > 
| > ===> Indexing
| > [root at alfa trembl]# dbxflat -dbname trembl -idformat EMBL
| > -directory . -filenames uniprot_trembl.dat -release "37.0"
| > -date "24/07/07" -fields sv,acc,des,key,orgDatabase b+tree
| > indexing for flat file databases
| > Resource name: embl
| > Processing file ./uniprot_trembl.dat
| > [root at alfa trembl]# du -hc *
| > 4.0K    dbxflat.command
| > 4.0K    trembl.ent
| > 4.0K    trembl.pxac
| > 4.0K    trembl.pxde
| > 4.0K    trembl.pxkw
| > 4.0K    trembl.pxsv
| > 4.0K    trembl.pxtx
| > 572M    trembl.xac
| > 4.2G    trembl.xde
| > 381M    trembl.xkw
| > 4.0K    trembl.xsv
| > 3.0G    trembl.xtx
| > 11G     uniprot_trembl.dat
| > 19G     total
| > 
| > I've also tried other combinations of 'method' (emboss,
| > emblcd) and 'format' (swiss, embl) without success ...
| > 
| > Am I indexing the db with the right incantation for dbxflat?
| > If so, what am I missing in my configuration?
| > 
| > Thanks again for any pointer,
| > 
| > Fernan
| > 
| > PS: this is on emboss-4.0.0 running on a Rocks Cluster (4.2,
| > CentOS)
| > 
| > _______________________________________________
| > EMBOSS mailing list
| > EMBOSS at lists.open-bio.org
| > http://lists.open-bio.org/mailman/listinfo/emboss
| > 
| 
| 
| 
| 
| _______________________________________________
| EMBOSS mailing list
| EMBOSS at lists.open-bio.org
| http://lists.open-bio.org/mailman/listinfo/emboss
| 
|
+----]




More information about the EMBOSS mailing list