[Biojava-l] Bad PDB files and batch processing with PDBFileReader
Andreas Prlic
andreas at sdsc.edu
Fri Oct 29 00:08:49 UTC 2010
good, I was just about to say that... ;-)
Andreas
On Thu, Oct 28, 2010 at 4:51 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
> Ahh, I suppose that is the "problem" referred to in the wiki? I
> checked out successfully from the repository on github.
>
> -da
>
> On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> It's not a big deal - after all if you use CA only, chains with no
>> CA's aren't important, and the error messages aren't that long. But
>> I'm going to switch anyway...
>> I'm getting the dreaded "can't read line length in file" error while
>> trying to checkout biojava-live/trunk, though.
>>
>> -da
>>
>> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Daniel,
>>>
>>> I just checked, this is a bug which is already resolved in 3.0... If
>>> it is an issue for you, you might want to upgrade... (should be very
>>> easy, if you start using Maven ...)
>>>
>>> Thanks,
>>> Andreas
>>>
>>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>> I'm using 1.7, partially because my distro had a package for it and
>>>> partially because I was initially using the online Javadoc a lot.
>>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>>> pasted them below. Chain A exists in the PDB but is DNA, polypeptide
>>>> chain F appears to parse correctly.
>>>>
>>>> -da
>>>>
>>>> org.biojava.bio.structure.StructureException: could not find chain A
>>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> at fragalign.pair.getStructs(pair.java:42)
>>>> at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: could not find chain B
>>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> at fragalign.pair.getStructs(pair.java:42)
>>>> at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >A<
>>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> at fragalign.pair.getStructs(pair.java:42)
>>>> at fragalign.Main.main(Main.java:40)
>>>> org.biojava.bio.structure.StructureException: did not find chain with
>>>> chainId >B<
>>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>>> at fragalign.pair.getStructs(pair.java:42)
>>>> at fragalign.Main.main(Main.java:40)
>>>>
>>>>
>>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> I assume AtomCache is a new class in BioJava3?
>>>>>
>>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>>
>>>>>>
>>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>>> handling returns control and started going after the wrong exceptions.
>>>>>> It does looks like if setParseCAOnly is true, the reader excepts on
>>>>>> chains with no CA's instead of just skipping them, though the other
>>>>>> chains are still parsed into the structure.
>>>>>
>>>>> This sounds like there might be a problem with CA only.. do you have
>>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>>> still issues..
>>>>>>>
>>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>>> block properly. You could do something like that:
>>>>>>>
>>>>>>> boolean splitFileOrganisation = true;
>>>>>>> AtomCache cache = new
>>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>>
>>>>>>> String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>>
>>>>>>> for (String pdbID : pdbIDs){
>>>>>>>
>>>>>>> try {
>>>>>>> Structure s = cache.getStructure(pdbID);
>>>>>>> if ( s == null) {
>>>>>>> System.out.println("could not find structure " + pdbID);
>>>>>>> continue;
>>>>>>> }
>>>>>>> // do something with the structure - your inner loop
>>>>>>> System.out.println(s);
>>>>>>>
>>>>>>> } catch (Exception e){
>>>>>>> // something crazy happened...
>>>>>>> System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>>> e.getMessage());
>>>>>>> e.printStackTrace();
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. No
>>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>>
>>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>>> stuff with 'em. My current code has a pair of nested while loops; the
>>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>>> and the inner iterates over the pieces from each. When
>>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>>> handling. I just need to know when those errors occur so I can skip
>>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>>> seriously impacted.
>>>>>>>>
>>>>>>>> -da
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>>> errors you would like to deal with on your end? You should not need
>>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>>> cases you are interested in? In this case we should support you with
>>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>>
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>>> too trivial.
>>>>>>>>>>
>>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>>> to exceptions during parsing with PDBFileParser. Because
>>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>>> of any error checking I do. I would like to catch the exceptions up
>>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>>> next file.
>>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>>> the library? Or should I test the returned StructureImpl objects for
>>>>>>>>>> possession of the fields in question? In that case, I'm not sure
>>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>>
>>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> -da
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
--
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------
More information about the Biojava-l
mailing list