[Bioperl-l] Fw: RE: Whitespace in locus causes problems for parsers

Michael Muratet mam at torchconcepts.com
Tue Jun 17 18:40:13 EDT 2003


This is a problem I posted a week or two ago. NCBI is going to fix this
instance, and hopefully others before they occur.



Begin forwarded message:

Date: Tue, 17 Jun 2003 13:49:00 -0400
From: "Messersmith, Donna (NIH/NLM/NCBI)" <messersm at ncbi.nlm.nih.gov>
To: "'mam at torchconcepts.com'" <mam at torchconcepts.com>
Cc: "Messersmith, Donna (NIH/NLM/NCBI)" <messersm at ncbi.nlm.nih.gov>,
"Romiti, Monica (NIH/NLM/NCBI)" <romiti at ncbi.nlm.nih.gov>
Subject: RE: Whitespace in locus causes problems for parsers

Dear Colleague,

We have asked our developers to replace the whitespace with an
underline, as
you point out.  Thank you for bringing this problem to our attention.

Donna Messersmith
NCBI User Services

------------- Begin Forwarded Message -------------

Date: Sun, 1 Jun 2003 15:14:59 -0500
From: Michael Muratet <mam at torchconcepts.com>
To: info at ncbi.nlm.nih.gov
Cc: bioperl-l at bioperl.org
Subject: Whitespace in locus causes problems for parsers
I was parsing CDS features in Refseq human (hs.gbff.gz) with bioperl
when it died on 'PSMAL/GCP III' (NM_153696). The
parser in bioperl is picking up the length from the LOCUS line and for
this record it sees 'III' and not '1992' bp because of the whitespace in
the locus between GCP and III. This causes the routine to fail.

It's a lot to ask of Bioperl (or any other package) to figure out every
possible formation for a locus, and those of us working with many
sequences must be able to parse automatically. I'd like to recommend
that Refseq (and Genbank, UniGene, etc) should adopt (or enforce
existing) rules about whitespace and punctuation marks in gene names. In
the meantime, I'd like to suggest you change the locus for NM_153696 to

Best regards,

Mike Muratet

