ML Assignment 3: LASSO-ing the Sexual Guilt Pixie
Assignment 3
In this assignment I continued to work with the Add Health dataset from the previous two assignments (one, and two). I had previously examined which factors were good predictors of a teen having had sex, and the degree to which someone would agree with “If you had sexual intercourse, afterward, you would feel guilty” was the most important predictor. Since LASSO is a form of linear regression, I had to now shift my attention to a quantitative predicted variable, so I decided to examine what factors would predict the degree to which someone agreed with the sexual guilt statement. Lasso regression was performed to determine which of the 19 quantitative or binary predictors would be important.
The data were randomly split into a 70% training set (N=1696) and a 30% test set (N=727). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 19 variables tested, 16 were retained in the selected model. The top 5 most important coefficients appear in the table below. Now this doesn’t determine whether any of these variables are correlated with religious strength. But it does show that guilt comes from a loss of respect from peers, upsetting the mother, and the social stigma around accidental pregnancy.
Coefficient | Var | Question |
---|---|---|
0.410922 | H1MO2 | If you had sexual intercourse, your partner would lose respect for you |
0.266673 | H1MO4 | If you had sexual intercourse, it would upset {NAME OF MOTHER} |
0.196177 | H1MO10 | If you got {someone} pregnant, it would be embarrassing |
-0.085100 | H1MO1 | If you had sexual intercourse, your friends would respect you more |
-0.079051 | H1MO5 | If you had sexual intercourse, it would give you a great deal of physical pleasure |
The training data R-square was 0.406 and the test data R-square was 0.461 which implies that the model explains a good amount of variation in sexual guilt while not being overfitted.
Code
Notes
LASSO (least absolute shrinkage and selection operator) regression is a shrinkage & variable selection method for linear regression.
Why Lasso?
Better prediction accuracy if small number of observations relative to number of predictors Can increase model interpretability. Simpler model that selects only most important predictors
Limitations
- Selection of variables is 100% statistically driven, rather than intuition or human-knowledge
- If predictors predictors are correlated, one will be arbitrarily selected
- Estimating p-values is not straightforward
- Different methods & software can produce different results
- No guarantee that selected model is not overfitted, nor that it’s the “best model”
Let me know what you think of this article on twitter @dumasraphael!