[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files

Tue May 1 21:37:39 UTC 2012

Hey,

I just got a response from NCBI:
> Hello,
> This is a formatting error and it will be fixed.
> Best,
> Majda

Regards,
Tjeerd

On 4/27/2012 5:10 PM, Tjeerd Boerman wrote:
> Hi,
>
> I had already issued an email to NCBI, and I got a response just now 
> that they are looking into it. So I guess we'll wait and see.
>
> Regards,
> Tjeerd
>
> On 04/27/2012 03:53 PM, George Waldon wrote:
>> Hi Tjeerd,
>>
>> This is an error in the GenBank file formatting. You should contact 
>> NCBI and ask them to fix it.
>>
>> - George
>>
>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>
>>> Hello,
>>>
>>> When parsing the file at
>>>
>>> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz 
>>>
>>>
>>> with BioJava 1.8.2, an exception occurs:
>>>
>>> ---begin exception---
>>> org.biojava.bio.seq.io.ParseException:
>>>
>>> A Exception Has Occurred During Parsing.
>>> Please submit the details that follow to biojava-l at biojava.org or 
>>> post a bug report to http://bugzilla.open-bio.org/
>>>
>>> Format_object=org.biojavax.bio.seq.io.GenbankFormat
>>> Accession=YP_004256772
>>> Id=325284232
>>> Comments=Bad dbxref
>>> Parse_block=FEATURES   Location/Qualifierssource   1..1174/organism  
>>>  "Deinococcus proteolyticus MRP"/strain   "MRP"/isolation_source   
>>> "feces"/host   "Lama glama"/culture_collection   "DSMZ:DSM 
>>> 20540"/db_xref   "taxon:693977"/plasmid   "pDEIPR02"/collected_by   
>>> "M. Kobatake MRP"Protein   1..1174/product   "hypothetical 
>>> protein"/calculated_mol_wt   129910Region   332..>674/region_name   
>>> "COG1002"/note   "Type II restriction enzyme, methylase subunits
>>> [Defense mechanisms]"/db_xref   "CDD:31206"CDS   1..1174/locus_tag   
>>> "Deipr_2283"/coded_by   "complement(NC_015162.1:211..3735)"/note   
>>> "COGs: COG1002 Type II restriction enzyme methylase
>>> subunits;
>>> KEGG: plm:Plim_2985 hypothetical protein;
>>> SPTR: Type II restriction endonuclease"/transl_table   11/db_xref   
>>> "InterPro:DNA methylase, N-6 adenine-specific,
>>> conserved site"/db_xref   "InterPro:N6 adenine-specific DNA
>>> methyltransferase, N12 class"/db_xref   "GeneID:10257767"
>>> Stack trace follows ....
>>>
>>>     at 
>>> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
>>>     at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
>>>     at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
>>>     at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
>>>     at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
>>> ---end exception---
>>>
>>>
>>> Every db_xref is matched with regular expression "^([^:]+):(\S+)$", 
>>> which enforces that the identifier after the colon contains no 
>>> whitespaces. Unfortunately, some db_xref identifiers for Interpro do 
>>> contain whitespaces, for example in CDS 1..1174 of protein 
>>> YP_004256772:
>>>
>>>                      /db_xref="InterPro:DNA methylase, N-6 
>>> adenine-specific,
>>>                      conserved site"
>>>                      /db_xref="InterPro:N6 adenine-specific DNA
>>>                      methyltransferase, N12 class"
>>>
>>> The Genbank format specification ( 
>>> http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not 
>>> mention this format, it only defines the Interpro cross-reference as:
>>>
>>> /db_xref="InterPro:IPR002928"
>>>
>>>
>>> My guess is that either the GenbankFormat parser is not compatible 
>>> with the GenPept format, or RefSeq is taking some liberties with the 
>>> Genbank specification. Any help would be appreciated!
>>>
>>> Best regards,
>>> Tjeerd
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>
>>
>>
>> --------------------------------
>> George Waldon
>>
>>