[Bioperl-l] Whitespace in locus causes problems for parsers

Michael Muratet mam at torchconcepts.com
Sun Jun 1 16:14:59 EDT 2003


I was parsing CDS features in Refseq human (hs.gbff.gz) with bioperl
when it died on 'PSMAL/GCP III' (NM_153696). The
parser in bioperl is picking up the length from the LOCUS line and for
this record it sees 'III' and not '1992' bp because of the whitespace in
the locus between GCP and III. This causes the routine to fail.

It's a lot to ask of Bioperl (or any other package) to figure out every
possible formation for a locus, and those of us working with many
sequences must be able to parse automatically. I'd like to recommend
that Refseq (and Genbank, UniGene, etc) should adopt (or enforce
existing) rules about whitespace and punctuation marks in gene names. In
the meantime, I'd like to suggest you change the locus for NM_153696 to

Best regards,

Mike Muratet

More information about the Bioperl-l mailing list