Mathematical modelling of DNA
Informations générales
Enseignant:
Horaires:
Cours: vendredis de 15h15 à 17h00, salle MAB111
Exercices: lundis de 15h15 à 17h00, salle MAA331
Assistant:
Cours
Requirements
1st and 2nd year courses in math or physics, (or with teacher's permission)
Helpful although not required
Differential Geometry of Framed Curves (MATH423) .
Contents
This course is designed to be an introduction, within the particular context of DNA, to the interplay between analysis, computation and experiment that makes up the process called mathematical modelling. In addition to students whose primary interest is in DNA, the syllabus is intended for students wishing an introduction to the modelling process in general, and the course will describe a number of widely encountered mathematical and computational techniques.
The course will be a detailed introduction to the cgDNA sequencedependent coarse grain model of DNA, including both how to use it to predict various biologically pertinent sequencedependent expectations with an associated Monte Carlo code, and all the extensive underlying applied mathematics necessary to estimate cgDNA parameter sets from a library of Molecular Dynamics simulations. The cgDNA model is a research tool that has its own web page . The course will work through the details of publications described on that page, specifically, [1],[2], and [3] below.
The course has five chapters.
0) Introduction to DNA and a brief overview of its coarse grain models.
1) The sequencedependent, rigidbase cgDNA model.
2) Monte Carlo methods for sampling cgDNA model equilibrium distributions and application to DNA persistence lengths.
3) Parameter estimation for the cgDNA model from Molecular Dynamics time series.
4) Equality constrained nonlinear optimisation with application to computing cgDNA equilibria.
Information for the exam.
The exam is an oral exam and it will take place the 24th of January in MA A3 31. We will communicate by email the starting time of each student. The exam will last 2025 minutes per student and no preparation time is provided. The questions are about all the exercises (theoretical and computational) plus surrounding theory from the course. The exam can be done in french or english.
REVISION SESSION: Monday the 22nd of January from 15h17h (<< UPDATED ON 17.01.2018) a revision session is organised in MA A1 10.
Weekbyweek correspondence
Week 1 (22.9) 
Description of the basic structure of DNA, and multiscaling (or coarse graining) approaches. The need for a tertiary structure model of DNA, i.e. a sequencedependent coarse grain model. Overview of the cgDNA coarse grain model to predict a Gaussian PDF for the configuration distribution of a DNA fragment of given sequence. (three periods lecture, one period exercises) Here you have the link to the supplementary material for this first lecture. 
Week 2 (29.9)  Coarse graining groups of atoms (in our case atoms forming a base) to a rigid body or frame (R,r), with the data structure of R∈ SO(3) r∈ R^3. Start of describing the group SO(3) of proper rotation matrices, ie 3x3 matrices R such that R^{1} = R^T and det R = +1. Interpretation of elements of SO(3) as direction cosine matrices 
Week 3 (6.10) 
Definition of Watson (or reading) and Crick strands. The Lie group SE(3) of rigid body displacements and its 4x4 matrix representation, both algebraic definition and geometrical interpretation.
Relative coordinates of a double chain of rigid bodies.
Given certain events of this week in Stockholm, we are adding links to some older LCVMM group publications concerning coarse grain modelling of cryoEM imaging data of DNA. Publications 110 , 121 and 48 have a certain, eminent coauthor, and articles 124 and 142 use coarse grain DNA modelling to further analyse experimental data taken in that lab. The notion of persistence length as described in 142 will be further discussed later in the course. 
Week 4 (13.10)  Complements of Cayley transform and inverse transform, connexions to matrix exponential and logarithm via Taylor series, and the reason for the factor 1/2 in the relation between Cayley vector norm and rotation angle. First mention of the additional scaling factor 1/5 in cgDNA coordinates. Symmetric coordinates for relative SE(3) displacements between a pair of rigid bodies, and the importance of introducing a midpoint frame. cgDNA internal coordinates and the associated tree structure for a double chain of rigid bodies. Watson or reading strand, and the reembedding of frames on the Crick strand to avoid rotations through angles close to \pi. Definition of basepair and junction frames as midframes. cgDNA model configuration coordinates: translations expressed in midframes (basepair frame between two base frames for intras, junction frames between two basepair frames for inters) and Cayley vectors of relative rotations for both intra and inter relative rotations (with matrix multiplication on the right). Appropriate figures can be found at here which is the supplementary material for article [2] in the Bibliography at the bottom of the page. This week we covered until Figure S3 and S4. Next weeks exercises give further examples. 
Week 5 (20.10)  Finish of cgDNA internal coordinates. Transformation of frames under CrickWatson change of reading strand and associated transformation rules for cgDNA coordinates (detailed treatment in exercise session on Monday). Indications of transformation of PDFs for a sequence S and the complementary sequence bar S, and the importance of palindromes. Odd and even coordinates for palindromic sequences. 
Weeks 6 (27.10)  Description of classic rigid base pair coarse grain models, where the assumptions of a Gaussian model with a) (two) nearestneighbour rigid base pair iteractions, plus b) dimer sequencedependence of parameter set blocks, implies a 6x6 block diagonal stiffness matrix for the inter variables. Described the count of ten independent dimerstep parameter set blocks that respect the CrickWatson reading strand transformation. For such rigid bae pair models for the ground, or expected, shape must have local sequence dependence. This sequence locality is not a good fit with Molecular Dynamics simulation data. Definitions and assumptions underlying the cgDNA rigid base coarse grain model free energy and its associated Gaussian PDF: a) (five) nearestneighbour base interactions, plus b) dimer sequencedependence of parameter set blocks. Leads to a Gaussian model where the stiffness matrix has a banded structure with overlapping 18x18 blocks. The assumption of localised sequencedependence of stiffness matrices and sigma vectors in the cgDNA model does not imply local sequence dependence of groud state because the inverse of a banded matrix is dense. End of Chapter 1. Much of the material of these lectures is covered in pages 25 of the PDF linked to under the Week 4 summary. We will also return to assess the accuracy of each assumption in the cgDNA model as part of Chapter 3 concerning parameter set estimation. 
Week 7 (3.11)  Start Chapter 2: What can be done with the cgDNA model? Brief discussion of i) probabilities and looping experiments, and longer discussion of ii) expectations, specifically correlations along a polymer chain. Numerical approximations of both from an ensemble of configurations generated by an appropriate Monte Carlo code e.g. cgDNAmc, counting hits and misses for i), and averaging over an ensemble as a simple quadrature rule for ii). Correlations of relative frame rotations and translations along a chain using homogeneous coordinates in SE(3) and the associated matrix multiplication. Simplifications when junction statistics are independent (the I.D. case), and when the chain is uniform (the I.I.D. case). Exponential decay of frame rotation correlations as the index difference grows, and convergence of the translation block to the Flory persistence vector. 
Week 8 (10.11)  Definitions of persistence lengths, and their analytical computation in a simplified model (a version of the Helical Worm Like Chain or HWLC model). Relation to numerics for the cgDNA model, and the need for shape factorisation. Polycopies are available for the material in weeks (7, 8) and for the Monte Carlo method. Shape factorised persistence length was introduced and is treated in the Exercise Session 7. 
Week 9 (17.11)  First hour: finish of discussion of tangenttangent and Flory persistence lengths for a) simplified I.I.D rigid base pair uniform model wih fixed translations, and comparison with b) cgDNAmc data for poly(XY) ie sequences with close to intrinsically straight ground states, and c) cgDNAmc data for lambdaphage sequences some with significantly bent ground states, where shape factorised semilog tantan plots remain close to linear. End Chapter 2. Second hour, start Chapter 3 Parameter Estimation for the cgDNA model. Discussion of Molecular Dynamics (MD) simulations of a training set of oligomers, orders of magnitude of the MD simulations for duration and size of data set files. Start estimation of mean and centred covariance from MD time series data for the cgDNA coarse grain variables. 
Week 10 (24.11)  Class replaced by exercise session due to illness. 
Week 11 (1.12 and 4.12)  Maximum likelihood and Maximum entropy approaches to obtaining estimates of oligomerbased Gaussian pdfs from an ensemble of configuration snapshots. Cases both with and without imposed banded sparsity pattern in the stiffness matrix. (Extra lectures on the 4th December to replace the missed lectures in Week 10) 
Week 12 (8.12)  Final remarks about estimating oligomer based, banded Gaussian models from MD simulation data. Consequences of fraying, large rotations and large Cayley vectors, sensitivity of covariance estimates to outliers in the ensemble, desirability of sequences with stable ends, and filtering outlying snapshots with broken hydrogen bonds. Benefits of estimating means and covariances from palindromic sequence oligomer. Designing training set sequence libraries: with current computational resources libraries with multiple instances of all 136 tetranucleotides possible, but pentanucleotide and hexanucleotide sequence libraries too large to obtain microsecond duration simulations. Rappel of parameter dependence of the cgDNA (predictive) model. Start of parameter set estimation from KullbackLeibler divergence (or relative entropy). 
Week 13 (15.12)  Parameter estimation using (a sum of) KullbackLeibler divergences (or relative entropies) as objective fitting functional for matching a model PDF to an observed PDF, with the two possibilities of ordering of arguments in the KL divergence. When used for truncating an oligomer based observed stiffness matrix to a prescribed sparsity pattern, one way round provides a third argument yielding the same answer as max likelihood and max entropy truncations. The other choice of ordering gives a different truncation that implies a significantly different truncated stiffness matrix, as shown by spectra of both apparent and dynamic persistence lengths computed with the cgDNAmc code. Some remarks regarding the extraction of a cgDNAparamset from a library of truncated training set oilogomer based Gaussians. Derivation of first order fitting conditions and differentiation of log det and inverse of a matrix with respect to the matrix entries. Differences between cgDNAweb paramsets, both for ground states and stiffnesses as visualised in spectra of persistence lengths. Units of cgDNA internal coordinates and the internal rescaling chosen in the cgDNA model between rotation and translation coordinates, but relative entropies invariant to different scalings (exercise 1 session 10). 
Summary and description of the exercices
This document contains an overview and a description of all the exercises given so far.
Exercices
Séries d'exercices  Corrigés  

Bibliography
The following references for the cgDNA model are available on the cgDNA web page .
 [1] A DNA CoarseGrain Rigid Base Model and Parameter Estimation from Molecular Dynamics Simulations , D. Petkevičiūtė Thesis #5520, EPFL, (2012).
 [2] cgDNA: a software package for the prediction of sequencedependent coarsegrain free energies of Bform DNA , D. Petkevičiūtė, M. Pasi, O. Gonzalez and J. H. Maddocks Nucleic Acids Research 42, no. 20 (2014), p. e153, (2014) .
 [3] A sequencedependent rigidbase model of DNA , O. Gonzalez, D. Petkevičiūtė, and J. H. Maddocks, Journal of Chemical Physics 138, no. 5 (2013), p. 055122 128 .
 [4] Sequencedependent persistence lengths of DNA , J. S. Mitchell, J. Glowacki, A. E. Grandchamp, R. S. Manning and J. H. Maddocks, Journal of Chemical Theory and Computation, no. 13 (2017), p. 15391555 .
 [5] Absolute versus relative entropy parameter estimation in a coarsegrain model of DNA , O. Gonzalez, M. Pasi, D. Petkevičiūtė, J. Glowacki, J.H. Maddocks, Multiscale Modeling and Simulation 15, no. 3 (2017), p. 1073  1107 .
References for general books on DNA.

[6] Understanding DNA, The molecule & how it work C. R. Calladine, H. R. Drew, B. F. Luisi, A. A. Travers, Third Edition, 2004, Academic Press, ISBN 9780121550893 .
Summary: Understanding DNA explains, step by step, why DNA forms specific structures, the form of these structures and how they fundamentally affect the biological processes of transcription and replication. 
[7] Unraveling Dna: The Most Important Molecule Of Life M. D. FrankKamenetskii, Revised and Updated Edition, 1997, Perseus Publishing, ISBN 9780201155846.
Summary: A curious blend of history, biographical details to cover the development of molecular biology from the influence of physicists earlier in the century, through the central dogma of molecular biology to discussion of social issues raised by genetic engineering. 
[8] DNA topology A. D. Bates & A. Maxwell, 2005, Oxford University Press, ISBN 9780198506553.
Summary: A clear, concise explanation of the relevance of supercoiling and catenation in the context of biological activity of the DNA molecule. 
[9] DNA structure and Function R. R. Sinden, 1994, Academic Press, ISBN 9780126457506.
Summary: a timely and comprehensive resource, that provides a simple yet comprehensive introduction to nearly all aspects of DNA structure. It also explains current ideas on the biological significance of classic and alternative DNA conformations.