[Biojava-dev] Biojava cant parse Uniprot Q5ZT67

Simon Foote simon.foote at nrc-cnrc.gc.ca
Fri Jun 14 13:23:46 UTC 2013


I had similar issues when parsing some bacterial sequences.

I made the change in the org.biojavax.bio.seq.io.UniProtFormat file at 
line 317 and now it works fine.

} else if (sectionKey.equals(SOURCE_TAG)) {
                     // use SOURCE_TAG and TAXON_TAG values
                     String sciname = null;
                     String comname = null;
                     List synonym = new ArrayList();
                     int taxid = 0;
                     for (int i = 0; i < section.size(); i++) {
                         String tag = ((String[])section.get(i))[0];
317:                        String value = 
((String[])section.get(i))[1].trim();
                         // Replace any newlines with spaces
                         value = value.replace("\n", " ");

I can commit the change if you like.

Cheers,
Simon

On 06/13/2013 04:49 PM, Spencer Bliven wrote:
> What if we just strip out the newline characters in OS records? That seems
> better than ignoring them or throwing an exception.
>
>
> On Thu, Jun 13, 2013 at 4:22 AM, <chris.morris at stfc.ac.uk> wrote:
>
>> HI,
>>
>> BioJava1.8.2 is unable to parse:
>>      http://www.uniprot.org/uniprot/Q5ZT67.txt
>>
>> It reports:
>>
>> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
>> Philadelphia 1 / ATCC 33152 / DSM 7513>
>> because of these lines:
>>
>> OS   Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
>> OS   ATCC 33152 / DSM 7513).
>>
>> It seems to me that the record is mistaken.
>>
>> If not, biojava needs a fix. The fix would be to replace:
>>
>> public SimpleNCBITaxonName(String nameClass, String name) {
>> 045            if (nameClass==null) throw new
>> IllegalArgumentException("Name class cannot be null");
>> 046            if (name==null) throw new IllegalArgumentException("Name
>> cannot be null");
>> 047            if (name.indexOf('\n') >= 0) throw new
>> IllegalArgumentException("NCBI taxonomy names cannot embed new lines -
>> at:"+name.indexOf('\n')+", in name: <"+name+">");
>> 048            this.nameClass = nameClass;
>> 049            this.name = name;
>> 050        }
>>
>> With:
>>
>> public SimpleNCBITaxonName(String nameClass, String name) {
>>            if (nameClass==null) throw new IllegalArgumentException("Name
>> class cannot be null");
>>           if (name==null) throw new IllegalArgumentException("Name cannot
>> be null");
>>            this.nameClass = nameClass;
>>            this.name = name.replaceAll("\\n", " ");
>>        }
>>
>> Regards,
>> Chris Morris
>>
>> -----Original Message-----
>> From: Morris, Chris (STFC,DL,SC)
>> Sent: 13 June 2013 12:14
>> To: 'Nikos Pinotsis'
>> Subject: RE: error: Cannot recognise format of the record, please refer to
>> the help pages
>>
>> Hi Nikos,
>>
>> Thank you for this important defect report.
>>
>> The library that PiMS uses to process Uniprot files reports this problem:
>>
>> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
>> Philadelphia 1 / ATCC 33152 / DSM 7513>
>>
>> In this part of the Uniprot record:
>>
>> GN   Name=legC7; OrderedLocusNames=lpg2298;
>> OS   Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
>> OS   ATCC 33152 / DSM 7513).
>> OC   Bacteria; Proteobacteria; Gammaproteobacteria; Legionellales;
>> OC   Legionellaceae; Legionella.
>>
>> I will release PiMS4.4 next week, and I will include a workaround in it. I
>> will also report the problem to Uniprot.
>>
>> Meanwhile, if you use a reference to the gene instead:
>>      GenBank YP_007567339.1
>> Then PiMS does upload the sequences successfully.
>>
>> Regards,
>> Chris
>>
>> -----Original Message-----
>> From: owner-pims-defects at dlmail2.dl.ac.uk [mailto:
>> owner-pims-defects at dlmail2.dl.ac.uk] On Behalf Of Nikos Pinotsis
>> Sent: 12 June 2013 19:24
>> To: pims-defects
>> Subject: error: Cannot recognise format of the record, please refer to the
>> help pages
>>
>> Hi ,
>>
>> I am using the PIMS in the http://pims.structuralbiology.eu:8080 site and
>> I am trying to download the target Q5ZT67_LEGPH or Q5ZT67 from several
>> databases, however I am always getting the same error that the format of
>> the record is not recognisable. Can you suggest me any solution
>>
>> thanks
>> Nikos
>>
>> --
>> Dr. Nikos Pinotsis
>> Professor Gabriel Waksman's Group
>> Crystallography , Birkbeck College
>> University of London
>> Malet Street
>> London WC1E 7HX, UK
>> T: +44 (0)207 631 6827
>> F: +44 (0)207 631 6803
>> M: +44 (0)792 384 3593
>>
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Bioinformatics Specialist
National Research Council of Canada | Conseil national de recherches Canada
Government of Canada | Gouvernement du Canada
100 Sussex Dr, Ottawa, Canada K1A 0R6
Telephone | Téléphone 613-990-3600 / Facsimile | Télécopieur 613-952-9092




More information about the biojava-dev mailing list