[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files

Tjeerd Boerman twboerman at gmail.com
Thu Apr 26 14:37:06 UTC 2012


Hello,

When parsing the file at

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz

with BioJava 1.8.2, an exception occurs:

---begin exception---
org.biojava.bio.seq.io.ParseException:

A Exception Has Occurred During Parsing.
Please submit the details that follow to biojava-l at biojava.org or post a 
bug report to http://bugzilla.open-bio.org/

Format_object=org.biojavax.bio.seq.io.GenbankFormat
Accession=YP_004256772
Id=325284232
Comments=Bad dbxref
Parse_block=FEATURES   Location/Qualifierssource   1..1174/organism   
"Deinococcus proteolyticus MRP"/strain   "MRP"/isolation_source   
"feces"/host   "Lama glama"/culture_collection   "DSMZ:DSM 
20540"/db_xref   "taxon:693977"/plasmid   "pDEIPR02"/collected_by   "M. 
Kobatake MRP"Protein   1..1174/product   "hypothetical 
protein"/calculated_mol_wt   129910Region   332..>674/region_name   
"COG1002"/note   "Type II restriction enzyme, methylase subunits
[Defense mechanisms]"/db_xref   "CDD:31206"CDS   1..1174/locus_tag   
"Deipr_2283"/coded_by   "complement(NC_015162.1:211..3735)"/note   
"COGs: COG1002 Type II restriction enzyme methylase
subunits;
KEGG: plm:Plim_2985 hypothetical protein;
SPTR: Type II restriction endonuclease"/transl_table   11/db_xref   
"InterPro:DNA methylase, N-6 adenine-specific,
conserved site"/db_xref   "InterPro:N6 adenine-specific DNA
methyltransferase, N12 class"/db_xref   "GeneID:10257767"
Stack trace follows ....

     at 
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
     at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
     at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
     at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
     at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
---end exception---


Every db_xref is matched with regular expression "^([^:]+):(\S+)$", 
which enforces that the identifier after the colon contains no 
whitespaces. Unfortunately, some db_xref identifiers for Interpro do 
contain whitespaces, for example in CDS 1..1174 of protein YP_004256772:

                      /db_xref="InterPro:DNA methylase, N-6 
adenine-specific,
                      conserved site"
                      /db_xref="InterPro:N6 adenine-specific DNA
                      methyltransferase, N12 class"

The Genbank format specification ( 
http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not mention 
this format, it only defines the Interpro cross-reference as:

/db_xref="InterPro:IPR002928"


My guess is that either the GenbankFormat parser is not compatible with 
the GenPept format, or RefSeq is taking some liberties with the Genbank 
specification. Any help would be appreciated!

Best regards,
Tjeerd



More information about the biojava-dev mailing list