[Biojava-dev] Biojava cant parse Uniprot Q5ZT67

Spencer Bliven sbliven at ucsd.edu
Thu Jun 13 20:49:14 UTC 2013


What if we just strip out the newline characters in OS records? That seems
better than ignoring them or throwing an exception.


On Thu, Jun 13, 2013 at 4:22 AM, <chris.morris at stfc.ac.uk> wrote:

> HI,
>
> BioJava1.8.2 is unable to parse:
>     http://www.uniprot.org/uniprot/Q5ZT67.txt
>
> It reports:
>
> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
> Philadelphia 1 / ATCC 33152 / DSM 7513>
> because of these lines:
>
> OS   Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
> OS   ATCC 33152 / DSM 7513).
>
> It seems to me that the record is mistaken.
>
> If not, biojava needs a fix. The fix would be to replace:
>
> public SimpleNCBITaxonName(String nameClass, String name) {
> 045            if (nameClass==null) throw new
> IllegalArgumentException("Name class cannot be null");
> 046            if (name==null) throw new IllegalArgumentException("Name
> cannot be null");
> 047            if (name.indexOf('\n') >= 0) throw new
> IllegalArgumentException("NCBI taxonomy names cannot embed new lines -
> at:"+name.indexOf('\n')+", in name: <"+name+">");
> 048            this.nameClass = nameClass;
> 049            this.name = name;
> 050        }
>
> With:
>
> public SimpleNCBITaxonName(String nameClass, String name) {
>           if (nameClass==null) throw new IllegalArgumentException("Name
> class cannot be null");
>          if (name==null) throw new IllegalArgumentException("Name cannot
> be null");
>           this.nameClass = nameClass;
>           this.name = name.replaceAll("\\n", " ");
>       }
>
> Regards,
> Chris Morris
>
> -----Original Message-----
> From: Morris, Chris (STFC,DL,SC)
> Sent: 13 June 2013 12:14
> To: 'Nikos Pinotsis'
> Subject: RE: error: Cannot recognise format of the record, please refer to
> the help pages
>
> Hi Nikos,
>
> Thank you for this important defect report.
>
> The library that PiMS uses to process Uniprot files reports this problem:
>
> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
> Philadelphia 1 / ATCC 33152 / DSM 7513>
>
> In this part of the Uniprot record:
>
> GN   Name=legC7; OrderedLocusNames=lpg2298;
> OS   Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
> OS   ATCC 33152 / DSM 7513).
> OC   Bacteria; Proteobacteria; Gammaproteobacteria; Legionellales;
> OC   Legionellaceae; Legionella.
>
> I will release PiMS4.4 next week, and I will include a workaround in it. I
> will also report the problem to Uniprot.
>
> Meanwhile, if you use a reference to the gene instead:
>     GenBank YP_007567339.1
> Then PiMS does upload the sequences successfully.
>
> Regards,
> Chris
>
> -----Original Message-----
> From: owner-pims-defects at dlmail2.dl.ac.uk [mailto:
> owner-pims-defects at dlmail2.dl.ac.uk] On Behalf Of Nikos Pinotsis
> Sent: 12 June 2013 19:24
> To: pims-defects
> Subject: error: Cannot recognise format of the record, please refer to the
> help pages
>
> Hi ,
>
> I am using the PIMS in the http://pims.structuralbiology.eu:8080 site and
> I am trying to download the target Q5ZT67_LEGPH or Q5ZT67 from several
> databases, however I am always getting the same error that the format of
> the record is not recognisable. Can you suggest me any solution
>
> thanks
> Nikos
>
> --
> Dr. Nikos Pinotsis
> Professor Gabriel Waksman's Group
> Crystallography , Birkbeck College
> University of London
> Malet Street
> London WC1E 7HX, UK
> T: +44 (0)207 631 6827
> F: +44 (0)207 631 6803
> M: +44 (0)792 384 3593
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



More information about the biojava-dev mailing list