[Biojava-dev] Biojava cant parse Uniprot Q5ZT67

chris.morris at stfc.ac.uk chris.morris at stfc.ac.uk
Thu Jun 13 11:22:03 UTC 2013


HI,

BioJava1.8.2 is unable to parse:
    http://www.uniprot.org/uniprot/Q5ZT67.txt

It reports:

NCBI taxonomy names cannot embed new lines - at:23, in name: <strain Philadelphia 1 / ATCC 33152 / DSM 7513>
because of these lines:

OS   Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
OS   ATCC 33152 / DSM 7513).

It seems to me that the record is mistaken. 

If not, biojava needs a fix. The fix would be to replace:

public SimpleNCBITaxonName(String nameClass, String name) {
045            if (nameClass==null) throw new IllegalArgumentException("Name class cannot be null");
046            if (name==null) throw new IllegalArgumentException("Name cannot be null");
047            if (name.indexOf('\n') >= 0) throw new IllegalArgumentException("NCBI taxonomy names cannot embed new lines - at:"+name.indexOf('\n')+", in name: <"+name+">");
048            this.nameClass = nameClass;
049            this.name = name; 
050        }

With:

public SimpleNCBITaxonName(String nameClass, String name) {
          if (nameClass==null) throw new IllegalArgumentException("Name class cannot be null");
         if (name==null) throw new IllegalArgumentException("Name cannot be null");
          this.nameClass = nameClass;
          this.name = name.replaceAll("\\n", " "); 
      }

Regards,
Chris Morris

-----Original Message-----
From: Morris, Chris (STFC,DL,SC) 
Sent: 13 June 2013 12:14
To: 'Nikos Pinotsis'
Subject: RE: error: Cannot recognise format of the record, please refer to the help pages

Hi Nikos,

Thank you for this important defect report.

The library that PiMS uses to process Uniprot files reports this problem:

NCBI taxonomy names cannot embed new lines - at:23, in name: <strain Philadelphia 1 / ATCC 33152 / DSM 7513>

In this part of the Uniprot record:

GN   Name=legC7; OrderedLocusNames=lpg2298;
OS   Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
OS   ATCC 33152 / DSM 7513).
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Legionellales;
OC   Legionellaceae; Legionella.

I will release PiMS4.4 next week, and I will include a workaround in it. I will also report the problem to Uniprot.

Meanwhile, if you use a reference to the gene instead:
    GenBank YP_007567339.1
Then PiMS does upload the sequences successfully.

Regards,
Chris

-----Original Message-----
From: owner-pims-defects at dlmail2.dl.ac.uk [mailto:owner-pims-defects at dlmail2.dl.ac.uk] On Behalf Of Nikos Pinotsis
Sent: 12 June 2013 19:24
To: pims-defects
Subject: error: Cannot recognise format of the record, please refer to the help pages

Hi ,

I am using the PIMS in the http://pims.structuralbiology.eu:8080 site and I am trying to download the target Q5ZT67_LEGPH or Q5ZT67 from several databases, however I am always getting the same error that the format of the record is not recognisable. Can you suggest me any solution

thanks
Nikos

--
Dr. Nikos Pinotsis
Professor Gabriel Waksman's Group
Crystallography , Birkbeck College
University of London
Malet Street
London WC1E 7HX, UK
T: +44 (0)207 631 6827
F: +44 (0)207 631 6803
M: +44 (0)792 384 3593





More information about the biojava-dev mailing list