This thesis is based on three publications [1, 2, 3].
The first publication [1] concerns a type of machine learning called “boosting.” Informally, boosting algorithms construct an accurate “boosted” classifier by combining several less accurate classifiers. Experimentally, the generalization error of boosted classifiers improves during training, even when training after achieving zero training error. Margin theory[8] attributes this desirable generalization phenomenon to large “margins.” This led to a line of research with a common goal [10, 12, 24, 43, 44]: maximize the smallest margin of a boosted classifier using as few “less accurate classifiers” as possible. This line of research culminated with AdaBoostV [44], which was later conjectured to yield an optimal trade-off between the number of “less accurate classifiers” and the smallest margin [40]. Our main contribution is an algorithm, SparsiBoost, which provides a better trade-off and thus refutes the conjecture. Finally, we present a lower bound which implies that SparsiBoost is optimal.
The second publication [2] concerns generalization error upper bounds from margin theory[8]. Despite numerous margin-based generalization upper bounds[8, 12, 21, 50], nothing is known about the tightness of these bounds. To this end, we present the first margin-based generalization lower bound. Our lower bound matches Breiman’s generalization upper bound [12], which depends on the smallest margin, ruling out any further improvements. Furthermore, our lower bound almost matches the k’th margin bound [21], the strongest known margin-based generalization upper bound, which depends on all margins, ruling out any improvements larger than a multiplicative log factor.
The third publication [3] concerns neural networks. Training neural networks sometimes entail computing time-consuming operations on their weight matrices, e.g., matrix determinants [31]. Operations like matrix determinants are often faster to compute given the Singular Value Decomposition (SVD). Previous work implicitly maintains the SVD of the weight matrices in a Neural Network without explicitly computing the SVD [52]. In theory, their technique allows faster determinant computation, however, in practice, we find no speed up. We present an algorithm, FastH, which is faster in practice (around5timesfaster than [52]). FastH has the same time complexity as [52], however, for ad×dweight matrix and a batch size b, FastH reduces the number of sequential stages from O(d) to O(d/b+b).