[Biojava-l] Bad PDB files and batch processing with PDBFileReader
Daniel Asarnow
dasarnow at gmail.com
Thu Oct 28 23:51:25 UTC 2010
Ahh, I suppose that is the "problem" referred to in the wiki? I
checked out successfully from the repository on github.
-da
On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow <dasarnow at gmail.com> wrote:
> It's not a big deal - after all if you use CA only, chains with no
> CA's aren't important, and the error messages aren't that long. But
> I'm going to switch anyway...
> I'm getting the dreaded "can't read line length in file" error while
> trying to checkout biojava-live/trunk, though.
>
> -da
>
> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Daniel,
>>
>> I just checked, this is a bug which is already resolved in 3.0... If
>> it is an issue for you, you might want to upgrade... (should be very
>> easy, if you start using Maven ...)
>>
>> Thanks,
>> Andreas
>>
>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>> I'm using 1.7, partially because my distro had a package for it and
>>> partially because I was initially using the online Javadoc a lot.
>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>>> pasted them below. Chain A exists in the PDB but is DNA, polypeptide
>>> chain F appears to parse correctly.
>>>
>>> -da
>>>
>>> org.biojava.bio.structure.StructureException: could not find chain A
>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> at fragalign.pair.getStructs(pair.java:42)
>>> at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: could not find chain B
>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>> at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> at fragalign.pair.getStructs(pair.java:42)
>>> at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: did not find chain with
>>> chainId >A<
>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> at fragalign.pair.getStructs(pair.java:42)
>>> at fragalign.Main.main(Main.java:40)
>>> org.biojava.bio.structure.StructureException: did not find chain with
>>> chainId >B<
>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>> at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>> at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>> at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>> at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>> at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>> at fragalign.pair.getStructs(pair.java:42)
>>> at fragalign.Main.main(Main.java:40)
>>>
>>>
>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> I assume AtomCache is a new class in BioJava3?
>>>>
>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>>
>>>>>
>>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>>> finally figured out that I had misunderstood where the Parser's error
>>>>> handling returns control and started going after the wrong exceptions.
>>>>> It does looks like if setParseCAOnly is true, the reader excepts on
>>>>> chains with no CA's instead of just skipping them, though the other
>>>>> chains are still parsed into the structure.
>>>>
>>>> This sounds like there might be a problem with CA only.. do you have
>>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>>>>
>>>>> -da
>>>>>
>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>>> still issues..
>>>>>>
>>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>>> block properly. You could do something like that:
>>>>>>
>>>>>> boolean splitFileOrganisation = true;
>>>>>> AtomCache cache = new
>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>>
>>>>>> String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>>
>>>>>> for (String pdbID : pdbIDs){
>>>>>>
>>>>>> try {
>>>>>> Structure s = cache.getStructure(pdbID);
>>>>>> if ( s == null) {
>>>>>> System.out.println("could not find structure " + pdbID);
>>>>>> continue;
>>>>>> }
>>>>>> // do something with the structure - your inner loop
>>>>>> System.out.println(s);
>>>>>>
>>>>>> } catch (Exception e){
>>>>>> // something crazy happened...
>>>>>> System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>>> e.getMessage());
>>>>>> e.printStackTrace();
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. No
>>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>>
>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>>> stuff with 'em. My current code has a pair of nested while loops; the
>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>>> and the inner iterates over the pieces from each. When
>>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>>> objects from the reader (database updates, the inner loop).
>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>>> StructureException is stopped there and never reaches my own error
>>>>>>> handling. I just need to know when those errors occur so I can skip
>>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>>> seriously impacted.
>>>>>>>
>>>>>>> -da
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>>> errors you would like to deal with on your end? You should not need
>>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>>> cases you are interested in? In this case we should support you with
>>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>>
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>>> too trivial.
>>>>>>>>>
>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>>> to exceptions during parsing with PDBFileParser. Because
>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>>> of any error checking I do. I would like to catch the exceptions up
>>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>>> next file.
>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>>> the library? Or should I test the returned StructureImpl objects for
>>>>>>>>> possession of the fields in question? In that case, I'm not sure
>>>>>>>>> which properties will give the most general success information...and
>>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>>
>>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> -da
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>
More information about the Biojava-l
mailing list