[Biojava-dev] fetching obsolete/superseding files

Spencer Bliven sbliven at ucsd.edu
Mon Apr 25 23:53:49 UTC 2011


Hey all,

I think we are converging on a consistent model of PDB precedence. This was
obscured previously by the bug in how the idStatus page listed only a single
'replacedBy' entry. Andreas has fixed this and it should go live tomorrow.
I'll write some unit tests and put update biojava at the same time. Here is
how things will work:

PDB supersessions form a directed acyclic graph, where edges point from an
obsolete ID to the entry that directly superseded it. Each record contained
by idStatus contains a "replaces" attribute, which consists of a
space-delimited list of incoming edges, and a "replacedBy" attribute, which
consists of a space-delimited list of outgoing edges. Two examples:

<idStatus>
<record structureId="1CAT" status="OBSOLETE" replacedBy="3CAT"/>
<record structureId="3CAT" status="OBSOLETE" replaces="1CAT"
replacedBy="8CAT 7CAT"/>
<record structureId="7CAT" status="CURRENT" replaces="3CAT"/>
<record structureId="8CAT" status="CURRENT" replaces="3CAT"/>

<record structureId="1KSA" status="OBSOLETE" replacedBy="3ENI"/>
<record structureId="3ENI" status="CURRENT" replaces="1M50 1KSA"/>
<record structureId="1M50" status="OBSOLETE" replacedBy="3ENI"/>
</idStatus>

The non-recursive versions of getReplaces/getReplacement just get the
incoming/outgoing edges for a single node and require only a single REST
query. The recursive versions will do a depth-first search up/down the tree
and return a list of all nodes reached.

Finally, the getCurrent() method should consistently return a single PDB ID
from among the results of recursive-getReplacement. To be consistent with
the old REST implementation, this will be the PDB ID that occurs last
alphabetically. Thus getCurrent(1HHB) will give 4HHB rather than 2HHB or
3HHB, getCurrent(1CAT) will give 8CAT, and getCurrent(7CAT) will give 7CAT.

Amr, I understand what you were thinking with the getNewestCurrent method.
It is appealing to think of 4HHB as the representative for all four
structures. However, there is a good reason that 2HHB and 3HHB are still
marked as current, and I think it is misleading to include a method that
favors 4HHB over other current IDs because it is alphabetically higher. We
should probably leave this method out of biojava.


Does anything seems wrong about this model of supersession? In particular,
does this address your question about the need for the recursion flag, Amr?
My plan is to commit the biojava changes shortly. Amr, do you mind if I
merge in your patch with the caching and PDBFileReader updates (Do you have
write access to SVN?)? Great code there!

Finally, the list of status messages come from looking at the internals of
the PDB website. I haven't come across any examples of them myself to test
with. Many seem to be temporary statuses, for publication holds and the
like. I'm content to ignore them until someone requests something specific.

-Spencer


On Mon, Apr 25, 2011 at 2:22 PM, Andreas Prlic <andreas at sdsc.edu> wrote:

> Hi Amr,
>
> > And any way, the webservice returns only ONE PDB ID max per record
> (please
> > inspect the result returned by this query
> > http://www.rcsb.org/pdb/rest/idStatus?structureId=1HHB,2HHB,3HHB,4HHB ).
>
> I believe that is a bug, I just fixed this and it should become
> available with tomorrows web site update (around 00UTC).
>
> > This way, I believe the best way to get the most recent ID is getting the
> > isReplacedBy attribute of the record of superseded record (e.g. from 3HHB
> to
> > 1HHB and then from 1HHB to 4HHB).
>
> hope this will be simpler with the updated URL response ...
>
>
> Andreas
>



More information about the biojava-dev mailing list