The remarkable development of computing power and other technology now allows scientists and businesses to routinely collect datasets of immense size and complexity. Most classical statistical methods were designed for situations with many observations and a few, carefully chosen variables. However, we now often gather data where we have huge numbers of variables, in an attempt to capture as much information as we can about anything which might conceivably have an influence on the phenomenon of interest. This dramatic increase in the number variables makes modern datasets strikingly different, as well-established traditional methods perform either very poorly, or often do not work at all.

Developing methods that are able to extract meaningful information from these large and challenging datasets has recently been an area of intense research in statistics, machine learning and computer science. In this course, we will study some of the methods that have been developed to study such datasets.

- Course notes which will be continually updated as the term progresses.

- Old course notes from an earlier version of the course (but which will differ in places compared to what I intend to lecture this year).

- The Elements of Statistical Learning (T. Hastie, R. Tibshirani and J. Friedman) has excellent background material for large parts of this course, presented in a less mathematical style.

- Statistics for High-Dimensional Data (P. Bühlmann and S. van de Geer) covers much of our course and in many places goes into much greater depth than we do.

- High-Dimensional Statistics (M. J. Wainwright) covers most of our course in greater depth, and is a great reference if you are continuing studies in this area.

- Statistical Learning with Sparsity (T. Hastie, R. Tibshirani and M. Wainwright) is excellent for the part of the course on the Lasso and its generalisations.

- Notes on the theory of RKHS (D. Sejdinovic, A. Gretton) gives an excellent detailed treatment of the theory of RKHS's.

- Some preliminary material prepared for another course may be helpful as a source of basic background material on linear algebra.

- Review of conditional expectations (Section 1.1)

The code for the demonstrations is written in R. Rstudio is a useful editor for R. Here are some introductory worksheets on R: Sheet 1, (solutions); Sheet 2, (solutions). The code for the demonstrations is given below.

- © 2021 the Statistical Laboratory, University of Cambridge

Information provided by webmaster@statslab.cam.ac.uk - Privacy and Cookies