In this assignment, we will perform several linear regression analyses on the
Boston dataset. Load the dataset as follows (requires installing the scikit-learn
library) and read the description:
from sklearn import datasets
boston = datasets . load_boston ()
print ( boston . DESCR )
The dataset contains 506 observations of 13 features, with the target value being
the median value of homes. Load the observations and targets into separate
numpy arrays:
data = boston . data
target = boston . target
We will use the first 450 observations as training data and the remaining 56 as
testing data:
X_train = data [:450 ,]
y_train = target [:450 ,]
X_test = data [450:,]
y_test = target [450:,]
Note: Please use torch to complete the problems in this assignment.
Problem 1
(10 points) Explore some of the relationships between the features of the data.
Which features appear to have the strongest relationship with the target? Which
features have the weakest relationship? Use a few plots to describe the data and
these relationships.
Problem 2
(20 points) Perform a multivariate linear regression on the Boston dataset without regularization. Report the coefficients of your trained model. Report the
following testing error metrics: RMSE, MAPE, MAE, MBE, R2
. Use plots to
show how your model performs.
Problem 3
(20 points) Perform a linear regression on the Boston dataset with l2-norm
regularization (i.e., ridge regression). Report the results as before.
Problem 4
(20 points) Perform a linear regression on the Boston datset with l1-norm regularization (i.e., lasso). Report the results as before. Compare the performances
of the three models from Problems 2, 3, and 4 and comment on the results.
Problem 5
(30 points) Perform a non-regularized linear regression on the Boston dataset
using 5-fold cross validation. Report the results as before. Does the non-regularized model perform better with or without cross-validation? Does this
agree with your expectations?
Note: If the size of the dataset is not evenly divisible by the number of folds
k, you may need to either (1) choose a different value for k or (2) exclude some
observations from the dataset in order to use np.split() as we discussed in
class.