[Biojava-l] Bad PDB files and batch processing with PDBFileReader

Thu Oct 28 04:05:18 UTC 2010

I'm using 1.7, partially because my distro had a package for it and
partially because I was initially using the online Javadoc a lot.
PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
pasted them below.  Chain A exists in the PDB but is DNA, polypeptide
chain F appears to parse correctly.

-da

org.biojava.bio.structure.StructureException: could not find chain A
       at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
       at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
       at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
       at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
       at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
       at fragalign.pair.getStructs(pair.java:42)
       at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: could not find chain B
       at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
       at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
       at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
       at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
       at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
       at fragalign.pair.getStructs(pair.java:42)
       at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: did not find chain with
chainId >A<
       at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
       at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
       at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
       at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
       at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
       at fragalign.pair.getStructs(pair.java:42)
       at fragalign.Main.main(Main.java:40)
org.biojava.bio.structure.StructureException: did not find chain with
chainId >B<
       at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
       at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
       at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
       at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
       at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
       at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
       at fragalign.pair.getStructs(pair.java:42)
       at fragalign.Main.main(Main.java:40)

On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>> I assume AtomCache is a new class in BioJava3?
>
> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>
>>
>> I must give you my embarrassed apology...after a bunch of testing I
>> finally figured out that I had misunderstood where the Parser's error
>> handling returns control and started going after the wrong exceptions.
>>  It does looks like if setParseCAOnly is true, the reader excepts on
>> chains with no CA's instead of just skipping them, though the other
>> chains are still parsed into the structure.
>
> This sounds like there might be  a problem with CA only.. do you have
> an example ID? also: are you on biojava 1.7 or 3.0 ?
>
> Andreas
>
>
>
>>
>> -da
>>
>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> PDB files are better nowadays, due to remediation, however there are
>>> still issues..
>>>
>>> it sounds like you just want to figure out how to do the try/catch
>>> block properly. You could do something like that:
>>>
>>>                boolean splitFileOrganisation = true;
>>>                AtomCache cache = new
>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>
>>>                String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>
>>>                for (String pdbID : pdbIDs){
>>>
>>>                        try {
>>>                                Structure s = cache.getStructure(pdbID);
>>>                                if ( s == null) {
>>>                                        System.out.println("could not find structure " + pdbID);
>>>                                        continue;
>>>                                }
>>>                                // do something with the structure - your inner loop
>>>                                System.out.println(s);
>>>
>>>                        } catch (Exception e){
>>>                                // something crazy happened...
>>>                                System.err.println("Can't load structure " + pdbID + " reason: " +
>>> e.getMessage());
>>>                                e.printStackTrace();
>>>                        }
>>>                }
>>>
>>>
>>>
>>>
>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> Glad to hear it, who doesn't like support or clean interfaces?.  No
>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>> the PDB is an indispensable resource for all protein scientists.
>>>>
>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>> stuff with 'em.  My current code has a pair of nested while loops; the
>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>> and the inner iterates over the pieces from each.  When
>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>> continue the outer loop, moving on to the next set of files without
>>>> executing any of the code that depends on correct StructureImpl
>>>> objects from the reader (database updates, the inner loop).
>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>> StructureException is stopped there and never reaches my own error
>>>> handling.  I just need to know when those errors occur so I can skip
>>>> those proteins - I am presuming that the correct entries will outweigh
>>>> the problem ones by a significant factor and the overall data wont be
>>>> seriously impacted.
>>>>
>>>> -da
>>>>
>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> can you explain a bit more what you are doing, in particular what
>>>>> errors you would like to deal with on your end?  You should not need
>>>>> to worry too much about exception handling. Are there any special
>>>>> cases you are interested in?  In this case we should support you with
>>>>> a clean interface rather than exception handling from your end...
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>> Hi all,
>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>> too trivial.
>>>>>>
>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>> to exceptions during parsing with PDBFileParser.  Because
>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>> of any error checking I do.  I would like to catch the exceptions up
>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>> continue statement and have my batch processing loops move on to the
>>>>>> next file.
>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>> the library?  Or should I test the returned StructureImpl objects for
>>>>>> possession of the fields in question?  In that case, I'm not sure
>>>>>> which properties will give the most general success information...and
>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>
>>>>>> If there is some great way to check if an exception was caught down a
>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -da
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>