Hi All,

Here are the averages for the final HWs in the following format (Average, Total, Std. Dev)

HW 9: 37.92 / 50 (10.02)

HW10: 39.29 / 50 (9.04)

HW11: 41.08 / 60 (13.53)

–TAs

Hi All,

Here are the averages for the final HWs in the following format (Average, Total, Std. Dev)

HW 9: 37.92 / 50 (10.02)

HW10: 39.29 / 50 (9.04)

HW11: 41.08 / 60 (13.53)

–TAs

Just a repeat announcement: You are allowed to bring a **hand-written**, **single-sided** sheet of notes to the final exam on Tuesday for reference.

All the best!

Today in the final class, we discussed game theory (slides posted in lecture-notes panel). We discussed general-sum games, zero-sum games, bluffing in poker, and we sketched how our bounds for the Randomized Weighted Majority algorithm in fact give a proof of the minimax theorem. We did not go into the proof of existence of Nash equilibria, but I left a proof sketch in the slides posted in case you are interested (don’t worry, you won’t be tested on that).

BTW, on 12/6 my (Avrim’s) office hours will be 10-11 instead of 4-5.

This lecture served as an introduction to information theory, one of the most important branches of applied mathematics invented in the 20th century. We began with some introductory remarks about the birth of this subject in Claude Shannon’s 1948 masterpiece “A Mathematical Theory of Communication,” mentioning the two key problems studied by the theory: data compression (*removing* redundancy in data) and coding for reliable communication (*adding* judicious redundancy to the data that gives it error-resilience).

We discussed some axioms that the “surprise” associated with an event of probability , and deduced the formula for this function. We then defined an(other) important quantity associated with a (discrete) random variable , namely its* entropy* , which is the expected surprise of the outcome of the random variable. More formally,

For , , where is the binary entropy function. This function plays an important role in asymptotic combinatorics, since for fixed as . We argued that if takes at most values, then , with equality attained when is uniformly distributed over its range of elements. Along the way, we saw the very useful Jensen’s inequality, which says that for a convex function , (the inequality is the special case ), and for concave functions, the inequality is reversed: .

Here are some notes from the Fall 2009 offering for further details about this. (We only covered Section 1 of the notes.)

In the second half of the lecture, we discussed the Shannon coding theorem for noisy channels, focusing on a simple channel called the binary symmetric channel which flips each input bit independently with probability . (In other words, in transmissions, it xors an error vector distributed as to the actual transmitted string.) This second part is * not* included for the final, but if you are interested in learning more or just having a reference for what we covered, here are some notes from the coding theory course I taught in Spring 2010. (The same notes in pdf.)

- The final exam is Tues Dec 7, 1:00-4:00pm, in GHC 4307.
- There will be a review session on Sunday (Dec 5) 11 am – 12:30 pm in GHC 4211. We will go through last year’s final exam. Please attempt the exam beforehand, and also please bring any questions you may have.
- In this week’s recitation, we will give a quick overview of the topics covered since the midterm.
- Please fill out the faculty course evaluation form by December 14.
- Please also fill out evaluation forms for your TAs by December 10: [Varun] [Ravi]

Since HW 11 is due on Tuesday, I (Varun) will have this week’s office hour on Monday (Nov 29) 5-6 pm.

In this lecture, we began by completing our “elementary” analysis of PAC learning, showing that for any class of functions H, if we see examples, then with probability at least , all hypotheses with will make at least one mistake on the sample. This means that we can be confident in any rules in H we find that are consistent with the data. The analysis of this was just a direct probability argument for a single function (fix a high error function h and *then* draw the sample) followed by a union bound.

We then saw that one could interpret this as a mathematical justification of Occam’s razor. In particular, for any way of describing functions that you have in your head, there can be at most functions of size (in bits) . So this means that (plugging in ) if you draw m examples, you can be confident in any rule you find that can be described in less than m/10 bits. In particular, even if each of us has a different way of describing things and a different notion of what is “simpler”, we are each justified in following the principle of Occam’s razor.

We then asked the question: what about functions like “linear separators in the plane” where there are infinitely many functions in the class. To analyze this, we discussed the idea of shatter coefficients and went into a tricky “double-sample” argument to show that you could replace “log of the total number of functions in H” with “log of the number of ways you can use functions in H to label a set of 2m points”. E.g., for the case of linear separators in the plane, the former quantity is infinite but the latter quantity is at most .

We then discussed a different model where rather than assuming examples are drawn from a probability distribution and labeled by a target function, we just have an arbitrary sequence of examples, and our aim is just to do “nearly as well as” the best of some set of predictors. This is often called the problem of *combining expert advice*, and we analyzed the deterministic and randomized Weighted Majority algorithms for this problem. What is especially nice is that the randomized version gets a particularly strong bound on the expected number of mistakes over *any* adversarially-chosen sequence of examples. For instance, we can set parameters to make at most mistakes, where is the number of mistakes of the best predictor in hindsight.

Today we looked at the following topics:

(1) Sampling techniques, in particular the Box-Muller method for generating samples from two independent N(0,1) distributions. We began by trying the invert-the-cdf approach for sampling, and noticed that we get “stuck” at the same place where we got stuck in class while computing . The trick to handle this was to somehow compute , and use its radial symmetry. We borrow the same trick into our problem as well. Instead of trying to sample from 1 gaussian, what if we sample from it twice?

We then saw that if (x,y) is such that and are sampled independently from Gaussians, then the random variable behaves as an exponential R.V, and the angle behaves as a uniform R.V . From this, we saw that if we “sample” and from the appropriate distributions, we get such that their pdfs are both from the Gaussian distribution, and furthermore, they are independent!

(2) We then moved on to solving the “opposite” problem, of estimating the gaussian given a collection of samples. For this, we calculated the (parameters) mean and variance that will *maximize the likelihood* of the occurrence of this sample given to us, and some calculus gave us that the best value for the estimated mean is and estimated variance is .

(3) Finally we saw the difference between convergence in probability, and convergence “almost surely”, and how this gives the different forms of the Law of Large Numbers (Weak / Strong).

We defined the moment generating function (MGF) of a random variable . We stated results that the MGF, when it exists, *characterizes* the random variable. The reason for the name moment generating function is that the MGF is the exponential generating function of the moments of . We calculated the MGF of the standard Gaussion to be . We saw that the MGF behaves very nicely under convolution and scaling:

Using this we proved that if are i.i.d copies of a random variable with mean 0, variance 1, and bounded moments, then the MGF of tends to for large , i.e.,

We concluded that the ‘s converge to the standard Gaussian distribution, in the sense stated by the central limit theorem. We also stated the Berry-Esseen theorem, which gives a quantitative bound on the error term (or “Kolmogorov distance”) in the CLT.

With this, we moved to the final special topics segment of the course. We began a discussion of the first special topic: Introduction to Machine Learning. Today we saw the concept learning setting and an algorithm to learn decision lists and its analysis.