Mathematical modelling of DNA
Informations générales
Enseignant:
Horaires:
Cours [starting 20.09.2022]: Tuesday 16h15 à 18h00, salle MAA330
Exercices [starting 21.09.2022]: Wednesday 10h15 à 12h00, salle CM1100
Assistant:
Raushan Singh, Rahul Sharma (rahul.sharma@epfl.ch)
Cours
Requirements
1st and 2nd year courses in math or physics, (or with teacher's permission).
Helpful although not required
Differential Geometry of Framed Curves (MATH423) .
Contents
This course is designed to be an introduction, within the particular context of DNA, to the interplay between analysis, computation and experiment that makes up the process called mathematical modelling. In addition to students whose primary interest is in DNA, the syllabus is intended for students wishing an introduction to the modelling process in general, and the course will describe a number of widely encountered mathematical and computational techniques.
The course will be a detailed introduction to the cgDNA and cgDNA+ sequencedependent coarse grain models of DNA, including both how to use them to predict various biologically pertinent sequencedependent expectations with an associated Monte Carlo code, and all the extensive underlying applied mathematics necessary to estimate their parameter sets from a library of Molecular Dynamics simulations. The cgDNA models are a research tool that have their own web page . The course will work through the details of publications described on that page, specifically, [1], [2], [3] and [6] below.
The course has four chapters.
0) Introduction to DNA and a brief overview of its coarse grain models.
1) The sequencedependent cgDNA models.
2) Monte Carlo methods for sampling cgDNA model equilibrium distributions and application to DNA persistence lengths.
3) Parameter estimation for the cgDNA models from Molecular Dynamics time series.
Note: Login id to access lecture notes is "moddna" and password will be communicated through email.
Information for the exam.
The exam is an oral exam. We will communicate by email the room and the starting time of each student. The exam will last 2025 minutes per student and no preparation time is provided. The questions are about all the exercises (theoretical and computational) plus surrounding theory from the course. The exam can be done in french or english.
Weekbyweek correspondence
Week 1 (20.09) 
Description of the basic structure of DNA, and multiscaling (or coarse graining) approaches. The need for a tertiary structure model of DNA, i.e. a sequencedependent coarse grain model. Overview of the cgDNA coarse grain model to predict a Gaussian PDF for the configuration distribution of a DNA fragment of given sequence. Here are the lecture notes and video of this week. Here you have supplementary material for this first lecture., eg some videos describing atomistic Molecular Dynamics (or MD) very fine grain simulations of DNA. 
Week 2 (27.09) 
Definiton of Inputs (configuration coordinate w, sequence S and parameter set P) and Output (a Gaussian pdf, with its ground state and stiffness matrix, on the configuration space coordinates w) of a cgDNA model. Motivation of why it is reasonable to expect an equilibrium configuration space pdf to exist, and why it is reasonable to estimate coarse grain parameter sets P with a few intensive fine grain atomisitc Molecular Dynamics simulations of dsDNA plus solvent bath. Here are the lecture notes and video and audio of this week. Note that in the lecture of this week the symbol "N" is mistakenly used to denote two different things namely the number of basepairs in the sequence S, and also the dimension of the configuration space coordinate vector w (and the groundstate \mu and size of the stifness matrix K). Should have been two different symbols. The length of the sequence is certainly related to the dimension of the configuration space coordinate vector (see future lectures) but they are never the same. 
Week 3 (04.10) 
Two lectures preparing for the specification of our dsDNA coordinate system using chains of rigid bodies. Lecture 1, the matrix groups SO(3) and SE(3). Lecture 2, interpretation of SO(3) and SE(3) as rigid body configurations and relative rigid body displacements between pairs of rigid body configurations. Here are the lecture notes and video and audio of Lecture 1. Here are the lecture notes and video and audio of Lecture 2. 
Week 4 (11.10)  Symmetric coordinates for relative SE(3) displacements between a pair of rigid bodies, and the importance of introducing a midpoint frame. Definition of mid frame involves (principal) square root of rotation matrix, but Euclidean average of origins. (more detail of choices in Exercise Session 4). With components of relative translation expressed in midframe, and components of Cayley vector of relative expression expressed in any of the three frames R, R^+, R^, the transformation on the coordinates corresponding to reversing roles of + and  bodies is u <> u, v<> v. A related transformation for the CrickWatson strand symmetry will be used in our coordinate system for double stranded DNA, but it is different because for the moment note that there is no account taken of the additional feature that the Crick and Watson embedding rules for frame into base are different. See Qu 2 in Series 4. Overview of material on Cayley transforms and Cayley vectors already introduced in Exercise sets. Geometrical view of the relations between axisangle, Eulerparameter/quaternion and Cayley vector parametrisations of SO(3). Here are the lecture notes and video and audio of this week. In the following lectures and exercises we will make use of various matrix factorisations. A brief summary of the results we will use are provided in this PDF . Most or indeed all of the factorisations should be familiar to you. 
Week 5 (18.10)  Generalisation of symmetrized relative SE(3) coordinates between a pair of rigid bodies to the cgDNA+ model internal coordinates and the associated tree structure for a quadruple chain of rigid bodies, with intra and inter and 5'phosphate coordinates. Watson or reading strand, and the reembedding of frames on the Crick strand to avoid rotations through angles close to \pi. Definition of basepair and junction frames as midframes. cgDNA+ model intra and inter configuration coordinates: translations expressed in midframes (basepair frame between two base frames for intras, junction frames between two basepair frames for inters) and Cayley vectors of relative rotations for both intra and inter relative rotations (with matrix multiplication on the right). 5'Phosphate Cayley vectors and translations in associated base frame (before reembedding for the Crick bases). Here are the lecture notes and video and audio of this week. 
Week 6 (25.10)  Further (and final) details on cgDNA+ coordinate vectors w. Non dimensionalisation and scaling of cgDNA+ coordinates. Transformation under CrickWatson change of reading strand for cgDNA+ coordinates (more detailed treatment in exercise session). Short introduction to using the cgDNAweb server. Here are the lecture notes and video and audio of this week. Here are the lecture notes and video and audio of short introduction to cgDNAweb+. 
Week 7 (01.11)  Definitions and assumptions underlying the cgDNA and cgDNA+ rigid base and double nucleotide chain coarse grain model free energies and their associated Gaussian PDFs: a) (five) nearestneighbour base/ nucleotide interactions, plus b) dimer sequencedependence of parameter set blocks. Leads to a Gaussian model where the stiffness matrix has a banded structure with overlapping blocks. Description of the CrickWatson symmetry properties of predicted sequence dependent groundstates and stiffness matrices, and the (sufficient) CrickWatson symmetry properties of the elements of the parameter set. The latter properties reduce the total number of independent blocks in the parameter set to 16 independent 5'XY cases and 10 independent XY dimer steps, 6/12 in nonpalindromic steps and 4 palindromic sets that have to satisfy selfsymmetry conditions. Here are the lecture notes (of two lecture) of this week. Here are the video and audio of Lecture 1. Here are the video and audio of Lecture 2. 
Week 8 (08.11)  Start of Chapter 2: What can be done with the cgDNA model? Brief discussion of i) probabilities and looping experiments, and longer discussion of ii) expectations, specifically correlations along a polymer chain. Numerical approximations of both from an ensemble of configurations generated by an appropriate (direct sampling for our multivariate Gaussian pdf as opposed to Metropolis or Markov chain sampling that are necessary for more complicated pdfs) Monte Carlo code e.g. cgDNApmc, counting hits and misses for i), and averaging over an ensemble as a simple quadrature rule for ii). First mention of expectations leading to persistence lengths. Importance for efficiency of MC of bandedness of the stiffness matrix and use of the associated banded Cholesky factorisation to diagonalise. Here are the lecture notes and video and audio of this week. 
Week 9 (15.11)  End of Chapter 2: More detailed discussion on expectations leading to tangenttangent and Flory persistence lengths. Correlations of relative frame rotations and translations along a chain using homogeneous coordinates in SE(3) and the associated matrix multiplication. Simplifications when junction statistics are independent (the I.D. case), and when the chain is uniform (the I.I.D. case). Exponential decay of frame rotation correlations as the index difference grows, and convergence of the translation block to the Flory persistence vector. Exercises comparison with cgDNAmc data for poly(XY) ie sequences with close to intrinsically straight ground states, and cgDNAmc data for lambdaphage sequences some with significantly bent ground states, where shape factorised semilog tantan plots remain close to linear is made in the exercise session. Here are the lecture notes and video and audio of this week. 
Week 10 (22.11)  Start of Chapter 3: Parameter Estimation in the cgDNA/+ model. Start estimation of oligomer based mean and centred covariance from MD time series data for the cgDNA coarse grain variables. This week describes the Maximum Likelihood approach to obtain estimates of oligomerbased Gaussian pdfs from an ensemble of configuration snapshots. Next week we consider an alternative Maximum Entropy approach. Cases both with and without imposed banded sparsity pattern in the stiffness matrix. Note: this weeks video lecture and notes are precisely the same as last years course, in which there was no mention of the cgDNA+ model, as opposed to the cgDNA model which is of course mentioned. Reusing the old video works as the material of this week and next is quite generally applicable to estimating parameters in any Gaussian pdf with an overlapping block structure in the stiffness matrix. The only difference between cgDNA and cgDNA+ for the purpose of this weeks lecture is that cgDNA is 18x18 overlapping blocks with 6x6 overlaps, while cgDNA+ is 42x42 overlapping blocks with 18x18 overlaps (and two end blocks different size), and these differences in block size make no difference to the general mathematical approach that is discussed. The 2021 exercises are adapted from the 2020 ones to make explicit mention of both cgDNA+ and cgDNA cases. Here are the lecture notes and video and audio of this week. 
Week 11 (29.11)  Introduction of entropy for a continuous pdf with respect to an associated measure, and relative entropy (or KullbackLeibler divergence) between two continuous pdfs with associated measure. Contrast between between discrete probability distributions and continuous probability density functions or pdfs, and differences between the cases of bounded and unbounded domains. Jensen inequality to prove that entropy minimizing (or maximizing depending on sign convention) pdf on a bounded domain is uniform with respect to the measure. Jensen inequality to prove that relative entropy is always nonnegative. Jaynes max entropy principle to characterize Boltzmann pdfs as pdfs maximising (or minimising in our sign convention) entropy subject to constrained of values of some moments, and sufficiency of the associated firstorder necessary conditions. In the case of constrained first and second moments on R^n, characterisation of Gaussian pdfs as maximum entropy distributions. Here are the lecture notes and video and audio of this week. 
Week 12 (06.12)  For the single (short) lecture of this week. Topic is general further properties of KullbackLeibler divergence. Specifically a) KL between two factorisable pdfs is sum of KL between each of the factors, which motivates the scaling of KL per degree of freedom, b) closed form expression for KL between two Gaussians (which makes the use of KL in high dimensional model parameter estimation feasible), and c) the two ways KL can be used in parameter estimation. Here are the lecture notes and video and audio for lecture of this week. 
Week 13 (13.12)  The final two lectures in the course (both long, partially compensated by the single short lecture of last week). Lecture 1, designing a training library for the cgDNA+ model, some aspects of Molecular Dynamics simulations, and using palindromy to assess convergence of time series of snapshots. Lecture 2: assessing validity of cgDNA+ model assumptions on palindromic training set data. Using Fisher information (or second variation of KL divergence) in numerics of cgDNA+ parameter set estimation. Constructing positive definite paraneter set blocks using nonuniqueness of optimal parameter sets. Here are the lecture notes and video and audio for lecture 1 of this week. Here are the lecture notes and video and audio for lecture 2 of this week. 
Week 14 (20.12)  Spare/review session 
Summary and description of the exercices
This document contains an overview and a description of all the exercises given so far.
Exercices
Séries d'exercices  Corrigés  

Bibliography
The following references for the cgDNA model are available on the cgDNA web page or on the LCVMM publications page .
 [1] A DNA CoarseGrain Rigid Base Model and Parameter Estimation from Molecular Dynamics Simulations , D. Petkevičiūtė Thesis #5520, EPFL, (2012).
 [2] cgDNA: a software package for the prediction of sequencedependent coarsegrain free energies of Bform DNA , D. Petkevičiūtė, M. Pasi, O. Gonzalez and J. H. Maddocks Nucleic Acids Research 42, no. 20 (2014), p. e153, (2014) .
 [3] A sequencedependent rigidbase model of DNA , O. Gonzalez, D. Petkevičiūtė, and J. H. Maddocks, Journal of Chemical Physics 138, no. 5 (2013), p. 055122 128 .
 [4] Sequencedependent persistence lengths of DNA , J. S. Mitchell, J. Glowacki, A. E. Grandchamp, R. S. Manning and J. H. Maddocks, Journal of Chemical Theory and Computation, no. 13 (2017), p. 15391555 .
 [5] Absolute versus relative entropy parameter estimation in a coarsegrain model of DNA , O. Gonzalez, M. Pasi, D. Petkevičiūtė, J. Glowacki, J.H. Maddocks, Multiscale Modeling and Simulation 15, no. 3 (2017), p. 1073  1107 .
 [6] A sequencedependent coarsegrain model of BDNA with explicit description of bases and phosphate groups parametrised from large scale Molecular Dynamics simulations , A. Patelli Thesis #9552, EPFL, (2019).
 [7] On the stochastic modeling of rigid body systems with application to polymer dynamics , J. Walter, O. Gonzalez and J.H. Maddocks, SIAM Multiscale Modeling and Simulation 8, no. 3 (2010), p. 1018 – 1053.
References for general books on DNA.

[8] Understanding DNA, The molecule & how it work C. R. Calladine, H. R. Drew, B. F. Luisi, A. A. Travers, Third Edition, 2004, Academic Press, ISBN 9780121550893 .
Summary: Understanding DNA explains, step by step, why DNA forms specific structures, the form of these structures and how they fundamentally affect the biological processes of transcription and replication. 
[9] Unraveling Dna: The Most Important Molecule Of Life M. D. FrankKamenetskii, Revised and Updated Edition, 1997, Perseus Publishing, ISBN 9780201155846.
Summary: A curious blend of history, biographical details to cover the development of molecular biology from the influence of physicists earlier in the century, through the central dogma of molecular biology to discussion of social issues raised by genetic engineering. 
[10] DNA topology A. D. Bates & A. Maxwell, 2005, Oxford University Press, ISBN 9780198506553.
Summary: A clear, concise explanation of the relevance of supercoiling and catenation in the context of biological activity of the DNA molecule. 
[11] DNA structure and Function R. R. Sinden, 1994, Academic Press, ISBN 9780126457506.
Summary: a timely and comprehensive resource, that provides a simple yet comprehensive introduction to nearly all aspects of DNA structure. It also explains current ideas on the biological significance of classic and alternative DNA conformations.