The Minimum Description Length (MDL) principle states that good learning can be achieved by selecting the model that provides the shortest description of the observed data. It is a key concept that bridges information theory and machine learning, enabling us to understand increasingly important machine learning problems from an information-theoretic viewpoint. In this talk, we first review methods for efficient lossless compression of data generated from an unknown probability distribution (universal coding), with a particular focus on two-stage (two-part) coding. We then introduce the MDL estimator based on two-stage codes and explain how it relates to standard learning formulations. Finally, we present a theorem by Barron and Cover that provides a generalization guarantee for this MDL estimator, thereby offering a rigorous mathematical justification for applying the MDL principle in machine learning.