[Biojava-l] Bad PDB files and batch processing with PDBFileReader

Fri Oct 29 00:08:49 UTC 2010

good, I was just about to say that... ;-)

Andreas

On Thu, Oct 28, 2010 at 4:51 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Ahh, I suppose that is the "problem" referred to in the wiki?  I
> checked out successfully from the repository on github.
>
> -da
>
> On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> It's not a big deal - after all if you use CA only, chains with no
>> CA's aren't important, and the error messages aren't that long.  But
>> I'm going to switch anyway...
>> I'm getting the dreaded "can't read line length in file" error while
>> trying to checkout biojava-live/trunk, though.
>>
>> -da
>>
>> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> I just checked, this is a bug which is already resolved in 3.0... If
>>> it is an issue for you, you might want to upgrade... (should be very
>>> easy, if you start using Maven ...)
>>>
>>> Thanks,
>>> Andreas
>>>
>>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> I'm using 1.7, partially because my distro had a package for it and
>>>> partially because I was initially using the online Javadoc a lot.
>>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>>> pasted them below.  Chain A exists in the PDB but is DNA, polypeptide
>>>> chain F appears to parse correctly.
>>>>
>>>> -da
>>>>
>>>> org.biojava.bio.structure.StructureException: could not find chain A
>>>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>>        at fragalign.pair.getStructs(pair.java:42)
>>>>        at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: could not find chain B
>>>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>>        at fragalign.pair.getStructs(pair.java:42)
>>>>        at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >A<
>>>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>>        at fragalign.pair.getStructs(pair.java:42)
>>>>        at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >B<
>>>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>>        at fragalign.pair.getStructs(pair.java:42)
>>>>        at fragalign.Main.main(Main.java:40)
>>>>
>>>>
>>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> I assume AtomCache is a new class in BioJava3?
>>>>>
>>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>>
>>>>>>
>>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>>> handling returns control and started going after the wrong exceptions.
>>>>>>  It does looks like if setParseCAOnly is true, the reader excepts on
>>>>>> chains with no CA's instead of just skipping them, though the other
>>>>>> chains are still parsed into the structure.
>>>>>
>>>>> This sounds like there might be  a problem with CA only.. do you have
>>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>>> still issues..
>>>>>>>
>>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>>> block properly. You could do something like that:
>>>>>>>
>>>>>>>                boolean splitFileOrganisation = true;
>>>>>>>                AtomCache cache = new
>>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>>
>>>>>>>                String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>>
>>>>>>>                for (String pdbID : pdbIDs){
>>>>>>>
>>>>>>>                        try {
>>>>>>>                                Structure s = cache.getStructure(pdbID);
>>>>>>>                                if ( s == null) {
>>>>>>>                                        System.out.println("could not find structure " + pdbID);
>>>>>>>                                        continue;
>>>>>>>                                }
>>>>>>>                                // do something with the structure - your inner loop
>>>>>>>                                System.out.println(s);
>>>>>>>
>>>>>>>                        } catch (Exception e){
>>>>>>>                                // something crazy happened...
>>>>>>>                                System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>>> e.getMessage());
>>>>>>>                                e.printStackTrace();
>>>>>>>                        }
>>>>>>>                }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?.  No
>>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>>
>>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>>> stuff with 'em.  My current code has a pair of nested while loops; the
>>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>>> and the inner iterates over the pieces from each.  When
>>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>>> handling.  I just need to know when those errors occur so I can skip
>>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>>> seriously impacted.
>>>>>>>>
>>>>>>>> -da
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>>> errors you would like to deal with on your end?  You should not need
>>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>>> cases you are interested in?  In this case we should support you with
>>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>>
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>>> too trivial.
>>>>>>>>>>
>>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>>> to exceptions during parsing with PDBFileParser.  Because
>>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>>> of any error checking I do.  I would like to catch the exceptions up
>>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>>> next file.
>>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>>> the library?  Or should I test the returned StructureImpl objects for
>>>>>>>>>> possession of the fields in question?  In that case, I'm not sure
>>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>>
>>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> -da
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>

-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------