[Biojava-dev] Biojava cant parse Uniprot Q5ZT67
Simon Foote
simon.foote at nrc-cnrc.gc.ca
Fri Jun 14 13:23:46 UTC 2013
I had similar issues when parsing some bacterial sequences.
I made the change in the org.biojavax.bio.seq.io.UniProtFormat file at
line 317 and now it works fine.
} else if (sectionKey.equals(SOURCE_TAG)) {
// use SOURCE_TAG and TAXON_TAG values
String sciname = null;
String comname = null;
List synonym = new ArrayList();
int taxid = 0;
for (int i = 0; i < section.size(); i++) {
String tag = ((String[])section.get(i))[0];
317: String value =
((String[])section.get(i))[1].trim();
// Replace any newlines with spaces
value = value.replace("\n", " ");
I can commit the change if you like.
Cheers,
Simon
On 06/13/2013 04:49 PM, Spencer Bliven wrote:
> What if we just strip out the newline characters in OS records? That seems
> better than ignoring them or throwing an exception.
>
>
> On Thu, Jun 13, 2013 at 4:22 AM, <chris.morris at stfc.ac.uk> wrote:
>
>> HI,
>>
>> BioJava1.8.2 is unable to parse:
>> http://www.uniprot.org/uniprot/Q5ZT67.txt
>>
>> It reports:
>>
>> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
>> Philadelphia 1 / ATCC 33152 / DSM 7513>
>> because of these lines:
>>
>> OS Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
>> OS ATCC 33152 / DSM 7513).
>>
>> It seems to me that the record is mistaken.
>>
>> If not, biojava needs a fix. The fix would be to replace:
>>
>> public SimpleNCBITaxonName(String nameClass, String name) {
>> 045 if (nameClass==null) throw new
>> IllegalArgumentException("Name class cannot be null");
>> 046 if (name==null) throw new IllegalArgumentException("Name
>> cannot be null");
>> 047 if (name.indexOf('\n') >= 0) throw new
>> IllegalArgumentException("NCBI taxonomy names cannot embed new lines -
>> at:"+name.indexOf('\n')+", in name: <"+name+">");
>> 048 this.nameClass = nameClass;
>> 049 this.name = name;
>> 050 }
>>
>> With:
>>
>> public SimpleNCBITaxonName(String nameClass, String name) {
>> if (nameClass==null) throw new IllegalArgumentException("Name
>> class cannot be null");
>> if (name==null) throw new IllegalArgumentException("Name cannot
>> be null");
>> this.nameClass = nameClass;
>> this.name = name.replaceAll("\\n", " ");
>> }
>>
>> Regards,
>> Chris Morris
>>
>> -----Original Message-----
>> From: Morris, Chris (STFC,DL,SC)
>> Sent: 13 June 2013 12:14
>> To: 'Nikos Pinotsis'
>> Subject: RE: error: Cannot recognise format of the record, please refer to
>> the help pages
>>
>> Hi Nikos,
>>
>> Thank you for this important defect report.
>>
>> The library that PiMS uses to process Uniprot files reports this problem:
>>
>> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
>> Philadelphia 1 / ATCC 33152 / DSM 7513>
>>
>> In this part of the Uniprot record:
>>
>> GN Name=legC7; OrderedLocusNames=lpg2298;
>> OS Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
>> OS ATCC 33152 / DSM 7513).
>> OC Bacteria; Proteobacteria; Gammaproteobacteria; Legionellales;
>> OC Legionellaceae; Legionella.
>>
>> I will release PiMS4.4 next week, and I will include a workaround in it. I
>> will also report the problem to Uniprot.
>>
>> Meanwhile, if you use a reference to the gene instead:
>> GenBank YP_007567339.1
>> Then PiMS does upload the sequences successfully.
>>
>> Regards,
>> Chris
>>
>> -----Original Message-----
>> From: owner-pims-defects at dlmail2.dl.ac.uk [mailto:
>> owner-pims-defects at dlmail2.dl.ac.uk] On Behalf Of Nikos Pinotsis
>> Sent: 12 June 2013 19:24
>> To: pims-defects
>> Subject: error: Cannot recognise format of the record, please refer to the
>> help pages
>>
>> Hi ,
>>
>> I am using the PIMS in the http://pims.structuralbiology.eu:8080 site and
>> I am trying to download the target Q5ZT67_LEGPH or Q5ZT67 from several
>> databases, however I am always getting the same error that the format of
>> the record is not recognisable. Can you suggest me any solution
>>
>> thanks
>> Nikos
>>
>> --
>> Dr. Nikos Pinotsis
>> Professor Gabriel Waksman's Group
>> Crystallography , Birkbeck College
>> University of London
>> Malet Street
>> London WC1E 7HX, UK
>> T: +44 (0)207 631 6827
>> F: +44 (0)207 631 6803
>> M: +44 (0)792 384 3593
>>
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
--
Bioinformatics Specialist
National Research Council of Canada | Conseil national de recherches Canada
Government of Canada | Gouvernement du Canada
100 Sussex Dr, Ottawa, Canada K1A 0R6
Telephone | Téléphone 613-990-3600 / Facsimile | Télécopieur 613-952-9092
More information about the biojava-dev
mailing list