Welcome. Here we give a brief introduction to the current research at Conformational Search Solutions.

In the past decade, the research has focused on the mechanics of protein folding. It culminated to the hypothesis that the protein structure can be characterized as a stable static equilibrium. We propose that the protein structure is fixated by interlockings between protein secondary structures mainly through the interior nonpolar sidechain interactions.

This working hypothesis is described in detail in the publication “Modeling protein structure as a stable static equilibrium”. (August 2022 issue of Physical Review E, Vol. 106,No. 2) The abstract and figures of the paper can be viewed here (Physical review E website). There are 23 pages in the paper and 49 pages in Supplemental material . The tables of content are provided below:

CONTENTS

I. Introduction 1

II. The Mechanisms of Interlocking 4

A. Molecular forces in a static model of protein structure

B. Blocking, double blocking and mutual blocking

C. Interlocking between two substructures

D. Assembly of substructures on the basis of interlocking

III. Truss Representation of Core Assemblies 8

A. Interlocking represented in a truss

B. Load distribution problem for protein core assemblies

C. Structural strength of core assemblies: failure load

IV. Comparing Interlocking Features of Core Assemblies 10

A. Redundancy in core assembly: duplicate and circular interlocking

B. Concentrated interlocking assembly

C. A longer helix vs. two short helices from the same chain segment

D. Implementations

V. Results in Comparing Core Assemblies 12

A. Various interlocking types and assembly patterns in native structures

B. Comparing assembly features between native structures and decoys

VI. Discussions 17

A. A distinct characteristic of protein structural stability: Compressive support

B. Buckling load of a blocking interaction

C. Stability and determinacy of a core assembly viewed through truss representation

VII. Conclusion 20

CONTENTS Supplemental Material

I. Strength of compressive support: Repulsions between interior nonpolar sidechains 2

II. Gaps on substructures and sidechain sizes 8

III. Instability of blocking action: Simulations of sidechain motions 16

A. Fluctuations of the angles between the vector connecting centroids of two interacting sidechains and a substructure axis 16

B. Fluctuations of the orientations of a sidechain relative to the axis of the substructure 16

IV. Solving load distribution by resolving indeterminacy 17

A. Load distribution at a 2-bar joint of a triangle truss 17

B. Load distribution in an interlocking: Solving a truss with indeterminacy of third degree 20

C. Load distribution in a cross interlocking: Solving a tetrahedron truss 27

D. Load distribution in a core assembly with three substructures 30

E. A demonstration for why a distant bar in a truss may receive less load 34

V. The significance in restricting the axial translational motion 36

VI. The reduction of interlocking force due to a buried unneutralized charged group 38

VII. Comparison of core assembly features 41

A. Core assembly features of beta sheet proteins 41

B. Sensitivity of the core assembly results to parameter value choices 42

C. Pruning decoys on the basis of energetical properties 47

References 49

Presently two lines of research are on-going: (1) a search program is being developed to enumerate likely secondary structure packing patterns based on the interlockings favored by the particular nonpolar-polar compositions of the sequence. (2) a program for calculating structural strength of protein core assemblies is being optimized so that it can be used in practical screening of the above mentioned packing patterns.

In the decade prior (2002–2009), the research is partially funded by NIGMS, under the project titled ”Prioritized Assembly in Protein Conformational Search”. The following is a brief summary of the final report.

Efficient Enumeration of Assemblies
of Protein Secondary Structures

While the apparent simplicity of some protein structures, such as 4-helix bundles and α-β barrels, suggests there might be a simple formulation of the physics involved, the vast diversity in the structural patterns and stabilities hints otherwise.

We propose to investigate protein structures by directly applying the ensemble approach of statistical mechanics. Such an approach could be practical if the partition function can be approximated by enumerating a sufficiently large set of low energy conformations.

A program has been developed for such enumeration. The enumeration scheme is designed so that it is refined in resolution, highly regular for efficient mathematical manipulation yet consistent with the inherent protein geometry.

The program has produced conformations for α,β and α∕β proteins to RMSDs around 3.2A for 100 residues. Singular value decomposition calculations show the conformations are diversely populated.

The Structural Model

United atom representation + polar hydrogen;
Model strands and helices are defined using statistically obtained ϕ∕ψ parameters. Model β-strands are refined to consider β bulges ( J. Richardson 1989), gaps and strand breaks. β-strand ϕ∕ψ’s are further refined to improve H-bonding potential.
We assume each helix or strand packs against at least one helix or strand.
Helix-helix knob-into-hole packing, as shown in fig 1, is well-known. We generalize it as a unified packing and indexing scheme. When two secondary structures pack, each contributes a pair of ”knobs” or ”holes” as the anchors for packing. Each C-α locates a knob. The void in the middle of knobs, defined by the delineating C-α’s, is a hole. The knobs or holes on a secondary structure surface form varied patterns, called streaks.
The angular distance between C-α atoms of residues i and i+1 on a helix is nearly exactly 100^∘. This implies that the angular distance between any two C-α’s is a multiple of 20 and confers a strong regularity on the streaks, making mathematical manipulcation easier.
When each party of a packing pair select two anchors, knobs or holes, the packing angle is fully determined. Thus, the space of secondary structure assemblies can be fully indexed by knob-hole matching, assuming the packing distance is determined by certain optimality requirement.
Sheet conformations are generated through enumerations of strand layout as if in a 2D lattice. The conformation of a sheet-sheet pair is produced through knob/hole matching. Similarly, Helix-strand and helix-helix.

The Enumeration Scheme

Conformation is constructed through a ”build-up” process; The core of conformation, an assembly of secondary structures, is built first.
The enumeration starts with the multiple secondary structure assignment for a given chain, which usually generates hundreds to thousands of assignments for a chain of about 100 residues.
The assembly is priority ordered, with the h-bonding pairs, i.e., intra-sheet strand-strand packing, first.
Strand-strand h-bonded packing can be optimized before helices packed against the strands.
All possible pairings of strands and helices are pre-calculated for each strand or helix pair based on the streak patterns and are reusable.
To further speed-up computation, an extra layer of enumeration, topological map, is introduced.
A loop closure procedure for 2-3 residue loops has been implemented. It is capable of handling loops that need a cis-peptide. The rest of the loop is enumerated by single residues. The loop residues on a chain are enmerated in a prioritized way so that the residue with the fewest moves will be considered first.
In the open space, for a loop length of n_l, we can expect nearly n_l⁴ possible loop configurations. However, when secondary structures and some loops have been placed, the number of configurations is severely reduced.

Figure 1:Matching and rotation of helix ”knob into hole” patterns.

Figure 2:Topological map as an abstraction of knob-hole packing patterns. Generating topological map is an auxiliary step in the enumeration. In comparison with the four distinct packing positions illustrated in fig. 1, the relative orientations of cylinders at this stage are only marked as in the regions of parallel, antiparallel, orthogonal, or ”anti-orthogonal”. A helix is considered to be positioned to the left, right, front or back side relative to another helix. Here, in a.2 if helix C1 is to the front of Cb, C2 will be to the back of Cb. In b.2 if helix C1 is to the front of Cb, depending on the axial directions, C2 is to the left or right of Cb. Both the orientation and the packing side will be expanded in the subsequent knob-into-hole packing.

Topological map

Geometric constraints, such as excluded volume, compactness and connectibility between secondary structures are powerful in pruning the search tree. These can be applied at a more coarse grain level of structural assembly.
While the knob-hole packing scheme is sufficient in defining a refined conformational space, an additional layer of enumeration is imposed for pure efficiency reason.
A packing sequence or topological map is a sequence of packing pairs in which packing positions are simply described as on left, right, front or back side and oriented in parallel, antiparallel, orthogonal (90^∘), and ”antiorthogonal” (270^∘). This essentially describes the topological relations of the structural elements. The axis direction of each helix or strand is resolved when making connections, based on the loop length and connection distance.

Figure 3:Cascade of mappings. An overview of the formalism and its corresponding enumeration program. The rectangular boxes are data and round-cornered boxes are mappings. In each rectangular box, F indicates the estimated branching factor for the mapping step. This factor is far smaller than the number of a full combinatorial enumeration. This is because the mapping result is ranked by the potential for optimality and then truncated. I.e., if a particular element in the combinatorial set will not entail conformations of sufficiently low potential, it is discarded.

Mapping a sequence to assemblies of secondary structures

In fig.3 box 2, each PDB sequence is segmented into helix and strand intervals according to the residue propensities for α, β or coil. Alternative ways of segmenting are applied, thus multiple secondary structure assignments. Each segmentation is scored by the number of residues whose secondary structure designations are consistent with their propensities.

An all-helix assignment goes straight to box 6. An α-β assignment will first go to boxes 3 and 4. After enumerating the strand layouts and the h-bonds are optimized, each resulting sheet is sent to 6. At this step, there is not enough geometric detail for ranking by exact potential. But geometric constraints with energetical consequences can be applied.

At box 7 the topological map is expanded into refined knob-hole packing. In box 8, the loops are added.


PDB	Chain	Number	Numbers	Core RMSD	Core	Conf	Refer-
ID	length	of core	of helices	w. native	RMSD	RMSD	ence
		residues	& strands	param.		.	RMSD


2MHR	118	73	(4 0)	.910	1.361	2.300	–

1NKL	78	56	(5 0)	–	1.777	2.522	3.836

1ECA	136	112	(8 0)	–	3.010	3.236	–

1MBC	153	119	(8 0)	–	2.122	3.283	–

1CTF	68	57	(3 3)	1.900	3.145	3.787	4.438

4FXN	138	89	(4 5)	1.890	3.477	4.700	–

8DFR	186	99	(5 8)	–	4.557	5.408	–

1PLC	99	48	(0 8)	–	3.050	4.550	–

1REI	107	49	(0 9)	–	3.650	4.750	–

Table 1:RMSDs of full conformations generated by Upbuild wrt. PDB structures

Results

Nine pdb sequences, representing α,β and α-β structures, are selected for experimenting with Upbuild, the enumeration program. To get a more definite comparison, we use both RMSDs and potentials as the criteria for the program performance.

Table 1 shows the closest RMSDs of generated conformations for near native secondary structure assignments. Here all RMSDs are achieved with model strands and helices, including the column labeled ”Core RMSD w. native param”, where the packing parameters, i.e., the translation and the rotations are extracted from PDB structures. The column ”Core RMSD” shows the result with the KH-packing (Knob-hole packing) enumeration. The column of ”Conf RMSD” indicates the RMSD for the full conformation. The ”Reference RMSD” column is for the minimum RMSD values reported in the decoy database from Levitt’s lab. (” Decoys R Us: A database of incorrect conformations to improve protein structure”, Ram Samudrala and Michael Levitt, Protein Sci, 2000, vol 9, 1399-1401).


PDB	Chain	Native	Native	KH-Enum	KH-Enum	KH-Enum	Ref.	Ref.
ID	length		cutf=9		cutf=9	MD		cutf=9


1NKL	78	-2404.51	-2417.85	-2479.80	-2542.20	-2581.56	-2448.40	-2460.90

1CTF	68	-2079.14	-2096.01	-2100*	-2140*	-2223*	-2120.00	-2142.22

4FXN	138	-4285.20	-4321.20	-4315*	-4382*	-4405*	–	–

1PLC	99	-2950.23	-2984.06	-2979.90	-2991.04	–	–	–

Table 2:Effective energy values for Upbuild generated conformations. Table 2 shows the lowest potential achieved by Upbuild in comparison with other sources. All conformations are minimized with a standard LBFGS quasi-Newton procedure, using EEF1 potential. ”cutf=9” indicates a cutoff of 9 A is used. Column ”Native” and ”Native cutf=9” are for the effective energy of minimized PDB structures. ”KH-enum” and ”KH-enum cutf=9” are for the Upbuild conformations. For these, before minimization a 15 ps molecular dynamics run is done to relax the conformations. ”KH-Enum MD” is for values obtained through a more extended, 60 ps MD run. ”Ref.” and ”Ref. cutf=9” are for the decoy conformations from Levitt’s lab.

Diversity of the Conformations

Consider C ∈ R^m: C-α coordinates of a conformation, m = 3L, L: chain-length.

Each C represents a distinct conformation, separated by C-α RMSD ≥ 3.0A^∘. Each conformation is minimized and has acceptable effective energy.

C₀: C-α coordinates of the reference conformation.

P = C − C₀: a point in the conformational space.

A^∗ = [P₁,P₂,...P_n]: Sampling space determined by the set of distinct conformations.

Using Singular Value Decomposition to Evaluate the Diversity

∗ A = U ΣV

.

n ×m n ×n m ×m A, Σ ∈ R ; U ∈ R ; V ∈ R

.

Σ = D ( σ1, σ2, .. σj , ... σm ); V = [v1, v2, ...vm ]

σ1 ≥ σ2 ≥ ..σj ≥ ... σm ≥ 0

∗ m (A )i = Σ ( σj uij )vj j=0

V is Right singular vectors (RSVs) representing the composition of the space. Each column vector of U is a PC scaled by σ in principle component analysis (PCA) for the ”natural scaling”.

A singular value σ_j is significant if σj
----
σ1 ≥ δ. Chosen δ = 0.02.

Let J such that σJ---
σ
1 ≥ δ ∧ σJ---+1--
σ
1 < δ.

We call J the effective dimension, a measure of diversity.

Because of the oversampling or undersampling of regions of the space, an iterative procedure need to be applied to maximize J.

The result is shown in table 3 for 1NKL, chain length L = 78, full dimension m = 234.


Structure Type	No. of Distinct	Effective
of conformations	Conformations	Dimension


Native
Secondary Structre	1200	159

Helix Only	+3675	179

α-β	+2995	205

2-Sheet	+550	206

α-β	+1772	208

Reference	11660*	195

Table 3:SVD results for various conformations for 1NKL. The experiment is conducted for different structural types of conformations that are collected when enumerating for different secondary structure assignments (cf. fig. 3). All of the types are enumerated selectively. The row of ”native secondary structure” indicates secondary structure assignment identical to the sequence of intervals of ((3 18 H) (24 37 H) (42 51 H) (53 61 H) (66 72 H)) direct from the pdb file or a sequence with slightly changed interval boundaries, e.g., from (3 18 H) to (2 18 H). ”α-β” indicates a single β-sheet with two or three strands plus helices. ”2-sheet” indicates two β sheets only. Reference conformations are taken from the decoy libraray mentioned earlier. In the column of ”Number of distinct conformations”, starting from the second row, the number indicated is added to the numbers of previous rows. Thus, the total for the experiment is 10192. In contrast, the number of conformations provided for the reference conformations is the net count of decoys. A rough estimates of their corresponding distinct conformations is about 3000 to 4000. The change of diversity shown in the last column indicates that the basic characteristics of the conformation (or conformational changes relative to the native conformation) is dominant in the decomposition. As the value immediately gets to 159. It increases ever slower as the conformations are added.

Feasibility of Approximating
the Partition Function

Partition function for an ensemble with a quantized interaction potential: Q = Σg(h)e^{−𝜖h∕kT}.

g(h) decreases exponentially with h. But the weight e^{−𝜖h∕kT} increases exponetially with h.

If a sequence has a unique structure, then the most probable species, with highest h, dominates Q. Even if it does not, several high h levels combined may still dominate Q.

Question: How many levels down from h_max should we collec conformations to approximate Q?

Using a residue grain-size resolution, quantized potential, lattice geometry and with simplifying assumptions about g(h) behavior, it can be shown that only two more levels of conformations need to be collected to approximate Q to within 2% error. This may have implications for ensembles of realistic protein conformations.