This week starts with a discussion of regression into machine learning and then involves several assignments on reproducing the results of published research papers.
- See this notebook on model evaluation
- See if you can reproduce the table in ISRS 5.29 using the original dataset in body.dat.txt, taken from here, in regression-pt2.R
- Do Labs 3.6.3 through 3.6.6 of Intro to Statistical Learning to get practice with linear models in R in ISL-3.6-exercises.Rmd
- Read Sections 6.1 through 6.3 of ISRS on regression with multiple features
- Do Exercises 6.1, 6.2, and 6.3, and use the original data set in babyweights.txt, taken from here, to reproduce the results from the book in regression-pt2.R
- Sections 3.2 and 3.3 of Intro to Statistical Learning (R version) on regression with multiple features
-
See the slides and notebook on overfitting and cross-validation
-
Read section 5.1 of An Introduction to Statistical Learning (R version) on cross-validation and do labs 5.3.1, 5.3.2, and 5.3.3 in ISL-5.3-exercises.Rmd
-
See this notebook on confounding_and_collinearity.Rmd (rendered here)
- Take a look at The Anatomy of the Long Tail and think about how to generate Figures 1 and 2 (you can ignore the null model in Figure 2)
- Use the download_movielens.sh script to download the MovieLens data
- Fill in code in the movielens.Rmd file to reproduce plots from lecture slides and Figures 1 and 2 from the paper
- Replicate and extend the results of the Google n-grams "culturomics" paper (pdf) using the template here
- Consider the last bit of this exercise on creating a Makefile "extra credit", here are some references for using GNU Make / Makefiles:
- Why Use Make? by Mike Bostock
- GNU Make for Reproducible Data Analysis by Zach Jones
The point of this exercise is to get experience in an open-ended prediction exercise: predicting the total number of Citibike trips taken on a given day. Create an RMarkdown file named predict_citibike.Rmd and do all of your work in it.
Here are the rules of the game:
- Use the
trips_per_day.tsvfile that has one row for each day, the number of trips taken on that day, and the minimum temperature on that day. - Split the data into randomly selected training, validation, and test sets, with 90% of the data for training and validating the model, and 10% for a final test set (to be used once and only once towards the end of this exercise). You can adapt the code from last week's complexity control notebook to do this. When comparing possible models, you can use a single validation fold or k-fold cross-validation if you'd like a more robust estimate.
- Start out with the model in that notebook, which uses only the minimum temperature on each day to predict the number of trips taken that day. Try different polynomial degrees in the minimum temperature and check that you get results similar to what's in that notebook, although they likely won't be identical due to shuffling of which days end up in the train, and validation splits. Quantify your performance using root mean-squared error.
- Now get creative and extend the model to improve it. You can use any features you like that are available prior to the day in question, ranging from the weather, to the time of year and day of week, to activity in previous days or weeks, but don't cheat and use features from the future (e.g., the next day's trips). You can even try adding holiday effects. You might want to look at feature distributions to get a sense of what tranformations (e.g.,
logor manually created factors such as weekday vs. weekend) might improve model performance. You can also interact features with each other. This formula syntax in R reference might be useful. - Try a bunch of different models and ideas, documenting them in your Rmarkdown file. Inspect the models to figure out what the highly predictive features are, and see if you can prune away any negligble features that don't matter much. Report the model with the best performance on the validation data. Watch out for overfitting.
- Plot your final best fit model in two different ways. First with the date on the x-axis and the number of trips on the y-axis, showing the actual values as points and predicted values as a line. Second as a plot where the x-axis is the predicted value and the y-axis is the actual value, with each point representing one day.
- When you're convinced that you have your best model, clean up all your code so that it saves your best model in a
.RDatafile using thesavefunction. - Commit all of your changes to git, using
git add -fto add the model.Rdatafile if needed, and push to your Github repository. - Finally, use the model you just developed and pushed to Github to make predictions on the 10% of data you kept aside as a test set. Do this only once, and record the performance in your Rmarkdown file. Use this number to make a guess as to how your model will perform on future data (which we'll test it on!). Do you think it will do better, worse, or the same as it did on the 10% test set you used here? Write your answer in your Rmarkdown notebook. Render the notebook and push the final result to Github.