2. Overview
• Email data from text book website
• 4601 emails (1813 spam)
• 58 features
– Ex: number of consecutive capital letters, number
of times a particular word appears (57)
– Classified as spam/not spam (1)
• Randomly split into training (3000) and testing
(1601) sets
Data Source: http://statweb.stanford.edu/~tibs/ElemStatLearn/. Creators: Mark Hopkins, Erik Reeber,
George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
3. Methods
• Linear and nonlinear methods
• Variable transformation and standardization
• Feature space modifications:
– No basis expansion
– Raw Polynomials (degree 3)
– Orthogonal Polynomials (degree 3)
– Natural Splines (Cross Validated using LR to df = 4)
• Misclassification is highly undesirable
• Definition of a “maybe spam” class