[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files

George Waldon gwaldon at geneinfinity.org
Fri Apr 27 13:53:37 UTC 2012


Hi Tjeerd,

This is an error in the GenBank file formatting. You should contact  
NCBI and ask them to fix it.

- George

Quoting Tjeerd Boerman <twboerman at gmail.com>:

> Hello,
>
> When parsing the file at
>
> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz
>
> with BioJava 1.8.2, an exception occurs:
>
> ---begin exception---
> org.biojava.bio.seq.io.ParseException:
>
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l at biojava.org or  
> post a bug report to http://bugzilla.open-bio.org/
>
> Format_object=org.biojavax.bio.seq.io.GenbankFormat
> Accession=YP_004256772
> Id=325284232
> Comments=Bad dbxref
> Parse_block=FEATURES   Location/Qualifierssource   1..1174/organism   
>  "Deinococcus proteolyticus MRP"/strain   "MRP"/isolation_source    
> "feces"/host   "Lama glama"/culture_collection   "DSMZ:DSM  
> 20540"/db_xref   "taxon:693977"/plasmid   "pDEIPR02"/collected_by    
> "M. Kobatake MRP"Protein   1..1174/product   "hypothetical  
> protein"/calculated_mol_wt   129910Region   332..>674/region_name    
> "COG1002"/note   "Type II restriction enzyme, methylase subunits
> [Defense mechanisms]"/db_xref   "CDD:31206"CDS   1..1174/locus_tag    
> "Deipr_2283"/coded_by   "complement(NC_015162.1:211..3735)"/note    
> "COGs: COG1002 Type II restriction enzyme methylase
> subunits;
> KEGG: plm:Plim_2985 hypothetical protein;
> SPTR: Type II restriction endonuclease"/transl_table   11/db_xref    
> "InterPro:DNA methylase, N-6 adenine-specific,
> conserved site"/db_xref   "InterPro:N6 adenine-specific DNA
> methyltransferase, N12 class"/db_xref   "GeneID:10257767"
> Stack trace follows ....
>
>     at  
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
>     at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
>     at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
>     at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
>     at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
> ---end exception---
>
>
> Every db_xref is matched with regular expression "^([^:]+):(\S+)$",  
> which enforces that the identifier after the colon contains no  
> whitespaces. Unfortunately, some db_xref identifiers for Interpro do  
> contain whitespaces, for example in CDS 1..1174 of protein  
> YP_004256772:
>
>                      /db_xref="InterPro:DNA methylase, N-6 adenine-specific,
>                      conserved site"
>                      /db_xref="InterPro:N6 adenine-specific DNA
>                      methyltransferase, N12 class"
>
> The Genbank format specification (  
> http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not  
> mention this format, it only defines the Interpro cross-reference as:
>
> /db_xref="InterPro:IPR002928"
>
>
> My guess is that either the GenbankFormat parser is not compatible  
> with the GenPept format, or RefSeq is taking some liberties with the  
> Genbank specification. Any help would be appreciated!
>
> Best regards,
> Tjeerd
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



--------------------------------
George Waldon





More information about the biojava-dev mailing list