cgDNA+: A sequence-dependent coarse-grain model of DNA with explicit description of bases and phosphate groups.

We introduce a sequence–dependent coarse–grain model of double–stranded DNA with an explicit description of both the bases and the phosphate groups as interacting rigid–bodies. The model parameters are trained on extensive, state–of–the–art large scale molecular dynamics (MD) simulations. The model paradigm relies on three main approximations: 1) nucleic acid bases and phosphate groups are rigid, 2) in- teractions are nearest–neighbour and can be modelled with a quadratic energy, 3) model parameters have dimer sequence dependence. For an arbitrary sequence, the model predicts a sequence–dependent Gaussian equilibrium probability distribution. The parameter set comprises dimer–based elements, which are used to reconstruct mean configurations, called ground–states, which can have strong non–local sequence dependence, and precision matrices, or stiffness matrices, for any sequence of any length. This prediction step is sufficiently efficient that it is straightforward to con- struct probability density functions for millions of fragments each of length a few hundred base–pairs. The estimation of a parameter set consists in minimising the sum of Kullback–Leibler divergences between Gaussians predicted by the model and analogous Gaussians estimated directly from MD simulations of a training library of sequences. The training library comprises a short list of short palindromic DNA sequences. We designed the palindromic library using an ad hoc algorithm to include multiple instances of all independent tetramer sub–sequences. We exploit palindromic symmetry properties to study the convergence of the statistics extracted from MD simulations of palindromes and to define palindromically symmetrised estimators of first and second centred moments. The computation of the parameter set is delicate and needs the use of sophisticated numerics. We present an efficient and reliable procedure for estimating a complete parameter set which involves a generalisation of the classic Fisher information matrix and its relationship to the relative entropy, or Kullback–Leibler divergence. The model is a computationally efficient tool that allows the study of the mechanical properties of double–stranded DNA of arbitrary length and sequence. We use the model to study the sequence–dependent rigidity of DNA and we compute sequence–dependent apparent and dynamic persistence lengths. The explicit treatment of the phosphate group also allows computation of sequence–dependent grooves widths. Moreover, with fine–grained representation of predicted ground–states, we can also study sequence–dependence of sugar puckering modes and BI–BII backbone conformations.