[Biojava-dev] fetching obsolete/superseding files

Andreas Prlic andreas at sdsc.edu
Sat Apr 23 04:19:07 UTC 2011


Thanks Spencer, This looks good.

Small detail:  The "is replaced by" is a one to many relationship.
Thinking about it, this is probably also be the case for "replaces".
I'll dig out some examples and send them to you.

Amr, I hope this is useful for what you were working on.

Andreas

On Fri, Apr 22, 2011 at 10:38 AM, Spencer Bliven <sbliven at ucsd.edu> wrote:
> Amr-
>
> I made a start on the problem of obsolete records. There's still no way to
> download them from biojava, but I added some code to check the status of a
> PDB ID and to get the current PDB ID for obsolete versions. Hopefully this
> complements whatever code you've been working on. See
> org.biojava.bio.structure.PDBStatus in the biojava3-structure module. Let me
> know if any of the documentation is unclear.
>
> -Spencer
>
>
> On Mon, Mar 21, 2011 at 8:24 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
>>
>> Amr-
>>
>> Thanks for volunteering to fix this! I ran across the same problem a while
>> ago, and ended up manually downloading obsolete records whenever my script
>> broke. Clearly you have the right solution.
>>
>> I would concider 2HHB 3HHB and 4HHB to all be valid IDs since they are all
>> 'current'. 1HHB is obsolete because it is a poor interpretation of the data,
>> not because it is redundant with the other three.
>>
>> -Spencer
>>
>> On Mon, Feb 28, 2011 at 9:59 AM, Amr AL-Hossary
>> <amr_alhossary at hotmail.com> wrote:
>>>
>>> Hi Dr. Adnreas,
>>>
>>> I was using a PDB files set, mentioned in an old paper, published in
>>> 1994.
>>> the paper is called
>>> Enlarged representative set of protein structures
>>> by
>>> UWE HOBOHM AND CHRIS SANDER
>>> European Molecular Biology Laboratory, 69012 Heidelberg, Germany
>>> (RECEIVEDS eptember 16, 1993; ACCEPTEDD ecember 23, 1993)
>>> published in
>>> Protein Science (1994), 3522-524. Cambridge University Press. Printed in
>>> the USA.
>>>
>>> It describes a representative standard set of protein structures that
>>> doesn't have any redundancy.
>>> This set was cited by a paper that talks about Cation-pi interactions as
>>> their representative set; and I was revisiting the same set to use it as my
>>> positive control in my research.
>>>
>>> Your idea (the webservice) is perfect.
>>> I can write it this weekend. till then, let's list all additional
>>> features that should be there too.
>>> I am thinking in
>>> static String[] udateIDs(String[] IdsToUpdate)
>>>
>>> Generally, I agree with you in not letting the parser be aware of
>>> versions, but I believe it should be at least aware of revisions of the file
>>> up to the point the local copy was created, and let the user be notified
>>> that this data is up to the date this file was created and could be
>>> outdated; in addition to mentioning it explicitly in the documentation.
>>>
>>> Well,
>>> Another point to think about:
>>> How to fight redundancy among several files?
>>> If we considered 1HHB, 2HHB, 3HHB, and 4HHB to be representing the same
>>> structure;
>>> If we initiate this request
>>> http://www.rcsb.org/pdb/rest/idStatus?structureId=1HHB,2HHB,3HHB,4HHB
>>> This is the response we get
>>> <?xml version='1.0' standalone='no' ?>
>>> <idStatus>
>>>  <record structureId="1HHB" status="OBSOLETE" replacedBy="4HHB" />
>>>  <record structureId="2HHB" status="CURRENT" replaces="1HHB" />
>>>  <record structureId="3HHB" status="CURRENT" replaces="1HHB" />
>>>  <record structureId="4HHB" status="CURRENT" replaces="1HHB" />
>>> </idStatus>
>>>
>>> How to counteract the redundancy in 2HHB, 3HHB, as long as 4HHB is
>>> already there !
>>> This could be the next question. :-)
>>>
>>> Sincerely,
>>> Amr
>>>
>>> --------------------------------------------------
>>> From: "Andreas Prlic" <andreas at sdsc.edu>
>>> Sent: Monday, February 28, 2011 8:15 AM
>>> To: "Amr AL-Hossary" <amr_alhossary at hotmail.com>
>>> Cc: <biojava-dev at lists.open-bio.org>
>>> Subject: Re: fetching obsolete/superseding files
>>>
>>>> Hi Amr,
>>>>
>>>>> During my research, I met some difficulty in automatically fetching
>>>>> some old
>>>>> obsolete files.
>>>>
>>>> ok. May I ask, how did you come across them?
>>>>
>>>>
>>>>> And that inspired me an idea
>>>>> I am thinking in adding 2 new features to the Biojava "structure"
>>>>> module:
>>>>
>>>> Interesting idea. In terms of software design I would not rely on the
>>>> parser for this. The local file that is parsed might be already out of
>>>> date as well. I would try to keep the parser agnostic of particular
>>>> versions or IDs. Instead I would provide a utility class that can give
>>>> information on the status of a file. There is a little XML service at
>>>> http://www.rcsb.org/pdb/software/rest.do#releaseStatus that provides
>>>> the latest status information. That one could be used to fetch the
>>>> information and then download any newer (or obsoleted) files...
>>>>
>>>> What do you think?
>>>>
>>>> Andreas
>>>>
>>>>> Supposing that there are 2 new boolean parameters of the PDB file
>>>>> reader/Parser which are
>>>>> <fetchOboslete> and <fetchSuperseding>
>>>>> The first one enables the reader to download a file from the "Obsolete"
>>>>> archive if it wasn't found in the main repository;
>>>>> while the later searches the header of a file (not necessarily the same
>>>>> one)for its newest revision or a superseding new file, fetches it,and
>>>>> switch to that new file automatically.
>>>>>
>>>>> Adding these parameters will need
>>>>> 1) Manipulate the URL a little, to enableconnecting
>>>>> toftp://ftp.wwpdb.org/pub/pdb/data/structures/obsoleteparsing
>>>>> 2) Parsing the OBSLTE,REVDAT, SPRSDE records; as well as REMARK 4, and
>>>>> REMARK 5
>>>>>
>>>>> If these features are approved, I can do them.
>>>>>
>>>>> Any ideas or comments?
>>>>>
>>>>>
>>>>>
>>>>> Amr
>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>



-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------




More information about the biojava-dev mailing list