Problem 1 (there is only one problem, in several parts)
The file CommunityCrime.csv is a dataset containing 319 observations on 123 variables.
The observations are communities within the United States. The data combines
socioeconomic data from the 1990 US Census, law enforcement data from the 1990 US
LEMAS survey, and crime data from the 1995 FBI Uniform Crime Reporting program. A
detailed description of all variables is available at http://archive.ics.uci.edu/ml/machine-
learning-databases/communities/communities.names. We seek to predict the variable
ViolentCrimesPerPop, the total number of violent crimes per 100,000 people.
Note: when asked to perform cross-validation to select a tuning parameter, be sure to
conduct this cross-validation on the training data only, then see how well your cross-
validated tuning parameter does on the test data.
a) Set a seed of 1 and split the data into a 90% training set, and a 10% test set.
b) Fit a linear model using least squares on the training set. Report the test error
obtained.
c) Fit a ridge regression model on the training set with ?? chosen by cross-validation.
Report the test error obtained.
d) Fit a lasso model on the training set with ?? chosen by cross-validation. Report the
test error obtained, along with the number of non-zero coefficient estimates.
Note: For parts e) and f) below: When you are identifying the value of ?? from the validation
plot, you may need to zoom in to see what is happening with a small number of components.
In your command, you can add in options like xlim=c(0,20) or ylim=c(0,1), adjusting the
numeric values as needed to provide a useful view.
e) Fit a PCR model on the training set with ?? chosen by cross-validation. Report the
test error obtained along with the value of ?? selected by cross-validation.
f) Fit a PLS model on the training set with ?? chosen by cross-validation. Report the
test error obtained, along with the value of ?? selected by cross-validation.
g) Comment on the above parts and how well you believe we can predict violent crime
rate using these methods.