[Biojava-dev] fetching obsolete/superseding files

Sun Apr 24 18:52:17 UTC 2011

Thank you both for the aid.

After adding caching & multiple query management per single request, I spent 
half a day trying to figure out why does a weird behavior occur, and finally 
I figured out that the problem is that the SAX parser uses the SAME object 
for all record tags !!!.

I'll revert to my original design using a DOM parser, this is more 
convenient & suitable for small XML documents.
please Spencer don't submit any updates till I send you my code tonight.

I wonder where are all theses stati come from?
Are they mentioned in a documentation somewhere that I hadn't seen?

Regarding the supersedes & replaces attributes, actually, in the PDB file 
format documentation, they said that a single new file replaces multiple old 
files (one to many) but didn’t mention the opposite..
And any way, the webservice returns only ONE PDB ID max per record (please 
inspect the result returned by this query 
http://www.rcsb.org/pdb/rest/idStatus?structureId=1HHB,2HHB,3HHB,4HHB ).
This way, I believe the best way to get the most recent ID is getting the 
isReplacedBy attribute of the record of superseded record (e.g. from 3HHB to 
1HHB and then from 1HHB to 4HHB).
Am I wrong some what?

Regards
Amr

--------------------------------------------------
From: "Andreas Prlic" <andreas at sdsc.edu>
Sent: Saturday, April 23, 2011 6:19 AM
To: "Spencer Bliven" <sbliven at ucsd.edu>
Cc: <biojava-dev at lists.open-bio.org>; "Amr AL-Hossary" 
<amr_alhossary at hotmail.com>
Subject: Re: [Biojava-dev] fetching obsolete/superseding files

> Thanks Spencer, This looks good.
>
> Small detail:  The "is replaced by" is a one to many relationship.
> Thinking about it, this is probably also be the case for "replaces".
> I'll dig out some examples and send them to you.
>
> Amr, I hope this is useful for what you were working on.
>
> Andreas
>
> On Fri, Apr 22, 2011 at 10:38 AM, Spencer Bliven <sbliven at ucsd.edu> wrote:
>> Amr-
>>
>> I made a start on the problem of obsolete records. There's still no way 
>> to
>> download them from biojava, but I added some code to check the status of 
>> a
>> PDB ID and to get the current PDB ID for obsolete versions. Hopefully 
>> this
>> complements whatever code you've been working on. See
>> org.biojava.bio.structure.PDBStatus in the biojava3-structure module. Let 
>> me
>> know if any of the documentation is unclear.
>>
>> -Spencer
>>
>>
>> On Mon, Mar 21, 2011 at 8:24 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
>>>
>>> Amr-
>>>
>>> Thanks for volunteering to fix this! I ran across the same problem a 
>>> while
>>> ago, and ended up manually downloading obsolete records whenever my 
>>> script
>>> broke. Clearly you have the right solution.
>>>
>>> I would concider 2HHB 3HHB and 4HHB to all be valid IDs since they are 
>>> all
>>> 'current'. 1HHB is obsolete because it is a poor interpretation of the 
>>> data,
>>> not because it is redundant with the other three.
>>>
>>> -Spencer
>>>
>>> On Mon, Feb 28, 2011 at 9:59 AM, Amr AL-Hossary
>>> <amr_alhossary at hotmail.com> wrote:
>>>>
>>>> Hi Dr. Adnreas,
>>>>
>>>> I was using a PDB files set, mentioned in an old paper, published in
>>>> 1994.
>>>> the paper is called
>>>> Enlarged representative set of protein structures
>>>> by
>>>> UWE HOBOHM AND CHRIS SANDER
>>>> European Molecular Biology Laboratory, 69012 Heidelberg, Germany
>>>> (RECEIVEDS eptember 16, 1993; ACCEPTEDD ecember 23, 1993)
>>>> published in
>>>> Protein Science (1994), 3522-524. Cambridge University Press. Printed 
>>>> in
>>>> the USA.
>>>>
>>>> It describes a representative standard set of protein structures that
>>>> doesn't have any redundancy.
>>>> This set was cited by a paper that talks about Cation-pi interactions 
>>>> as
>>>> their representative set; and I was revisiting the same set to use it 
>>>> as my
>>>> positive control in my research.
>>>>
>>>> Your idea (the webservice) is perfect.
>>>> I can write it this weekend. till then, let's list all additional
>>>> features that should be there too.
>>>> I am thinking in
>>>> static String[] udateIDs(String[] IdsToUpdate)
>>>>
>>>> Generally, I agree with you in not letting the parser be aware of
>>>> versions, but I believe it should be at least aware of revisions of the 
>>>> file
>>>> up to the point the local copy was created, and let the user be 
>>>> notified
>>>> that this data is up to the date this file was created and could be
>>>> outdated; in addition to mentioning it explicitly in the documentation.
>>>>
>>>> Well,
>>>> Another point to think about:
>>>> How to fight redundancy among several files?
>>>> If we considered 1HHB, 2HHB, 3HHB, and 4HHB to be representing the same
>>>> structure;
>>>> If we initiate this request
>>>> http://www.rcsb.org/pdb/rest/idStatus?structureId=1HHB,2HHB,3HHB,4HHB
>>>> This is the response we get
>>>> <?xml version='1.0' standalone='no' ?>
>>>> <idStatus>
>>>> <record structureId="1HHB" status="OBSOLETE" replacedBy="4HHB" />
>>>> <record structureId="2HHB" status="CURRENT" replaces="1HHB" />
>>>> <record structureId="3HHB" status="CURRENT" replaces="1HHB" />
>>>> <record structureId="4HHB" status="CURRENT" replaces="1HHB" />
>>>> </idStatus>
>>>>
>>>> How to counteract the redundancy in 2HHB, 3HHB, as long as 4HHB is
>>>> already there !
>>>> This could be the next question. :-)
>>>>
>>>> Sincerely,
>>>> Amr
>>>>
>>>> --------------------------------------------------
>>>> From: "Andreas Prlic" <andreas at sdsc.edu>
>>>> Sent: Monday, February 28, 2011 8:15 AM
>>>> To: "Amr AL-Hossary" <amr_alhossary at hotmail.com>
>>>> Cc: <biojava-dev at lists.open-bio.org>
>>>> Subject: Re: fetching obsolete/superseding files
>>>>
>>>>> Hi Amr,
>>>>>
>>>>>> During my research, I met some difficulty in automatically fetching
>>>>>> some old
>>>>>> obsolete files.
>>>>>
>>>>> ok. May I ask, how did you come across them?
>>>>>
>>>>>
>>>>>> And that inspired me an idea
>>>>>> I am thinking in adding 2 new features to the Biojava "structure"
>>>>>> module:
>>>>>
>>>>> Interesting idea. In terms of software design I would not rely on the
>>>>> parser for this. The local file that is parsed might be already out of
>>>>> date as well. I would try to keep the parser agnostic of particular
>>>>> versions or IDs. Instead I would provide a utility class that can give
>>>>> information on the status of a file. There is a little XML service at
>>>>> http://www.rcsb.org/pdb/software/rest.do#releaseStatus that provides
>>>>> the latest status information. That one could be used to fetch the
>>>>> information and then download any newer (or obsoleted) files...
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Andreas
>>>>>
>>>>>> Supposing that there are 2 new boolean parameters of the PDB file
>>>>>> reader/Parser which are
>>>>>> <fetchOboslete> and <fetchSuperseding>
>>>>>> The first one enables the reader to download a file from the 
>>>>>> "Obsolete"
>>>>>> archive if it wasn't found in the main repository;
>>>>>> while the later searches the header of a file (not necessarily the 
>>>>>> same
>>>>>> one)for its newest revision or a superseding new file, fetches it,and
>>>>>> switch to that new file automatically.
>>>>>>
>>>>>> Adding these parameters will need
>>>>>> 1) Manipulate the URL a little, to enableconnecting
>>>>>> toftp://ftp.wwpdb.org/pub/pdb/data/structures/obsoleteparsing
>>>>>> 2) Parsing the OBSLTE,REVDAT, SPRSDE records; as well as REMARK 4, 
>>>>>> and
>>>>>> REMARK 5
>>>>>>
>>>>>> If these features are approved, I can do them.
>>>>>>
>>>>>> Any ideas or comments?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Amr
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>
>>
>
>
>
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>