[Biojava-dev] Should SEQRES/_pdbx_poly_seq_scheme records be part of headerOnly?

Tue Nov 24 00:08:51 UTC 2015

Greetings,

My company is preparing to submit a PR for Issue 353, "mmCIF parsing support for missing SEQRES information." The PR passes existing integration tests where an empty SEQRES component list is expected when FileParsingParameters.headerOnly = True.

I suggest that SEQRES (PDB format) and _pdbx_poly_seq_scheme (mmCIF format) should be considered part of the header, which would allow a user to extract the chain sequences from a file without requiring the full, heavy weight parsing of the atom coordinate records. This is a valuable computational saving for people who are data mining information from header records across the PDB. Examples include creating custom sequence collections for compiling PDB-based BLAST databases, quickly converting local PDB/mmCIF structure files to sequences for calculating multiple sequence alignment, among others.

I am asking the BioJava community for their thoughts to these questions:

1. Is it acceptable to elevate this sequence information to "the header?"
2. If so, is it acceptable to include this feature as part of Issue 353?
3. If not, is it acceptable to create a new FileParsingParameter (e.g. "setParseSeqRes") to allow extracting the sequence information without the atomic coordinates?

Best regards,
Steve

--
Steve Darnell
DNASTAR, Inc.
Madison, WI USA