Name	Name	Last commit message	Last commit date
parent directory ..
ngrams	ngrams
ISL-3.6-exercises.Rmd	ISL-3.6-exercises.Rmd
ISL-5.3-exercises.Rmd	ISL-5.3-exercises.Rmd
README.html	README.html
README.md	README.md
allbut.pl	allbut.pl
babyweights.txt	babyweights.txt
body.dat.txt	body.dat.txt
citibike_model.RData	citibike_model.RData
complexity_control.ipynb	complexity_control.ipynb
download_movielens.sh	download_movielens.sh
get_citibike_features.sh	get_citibike_features.sh
model_evaluation.ipynb	model_evaluation.ipynb
movielens.Rmd	movielens.Rmd
movielens.html	movielens.html
predict_citibike.Rmd	predict_citibike.Rmd
split_ratings.sh	split_ratings.sh
trips_per_day.tsv	trips_per_day.tsv

This week starts with a discussion of regression into machine learning and then involves several assignments on reproducing the results of published research papers.

Day 1

Regression (cont'd)

See this notebook on model evaluation
See if you can reproduce the table in ISRS 5.29 using the original dataset in body.dat.txt, taken from here, in regression-pt2.R
Do Labs 3.6.3 through 3.6.6 of Intro to Statistical Learning to get practice with linear models in R in ISL-3.6-exercises.Rmd
Read Sections 6.1 through 6.3 of ISRS on regression with multiple features
Do Exercises 6.1, 6.2, and 6.3, and use the original data set in babyweights.txt, taken from here, to reproduce the results from the book in regression-pt2.R

References

Sections 3.2 and 3.3 of Intro to Statistical Learning (R version) on regression with multiple features

Day 2

Overfitting, generalization, and model complexity

See the slides and notebook on overfitting and cross-validation
Read section 5.1 of An Introduction to Statistical Learning (R version) on cross-validation and do labs 5.3.1, 5.3.2, and 5.3.3 in ISL-5.3-exercises.Rmd
See this notebook on confounding_and_collinearity.Rmd (rendered here)
Investigating link between coffee and cancer

Day 3

The long tail

Take a look at The Anatomy of the Long Tail and think about how to generate Figures 1 and 2 (you can ignore the null model in Figure 2)
Use the download_movielens.sh script to download the MovieLens data
Fill in code in the movielens.Rmd file to reproduce plots from lecture slides and Figures 1 and 2 from the paper

Day 4

N-gram data and "Culturonomics"

Replicate and extend the results of the Google n-grams "culturomics" paper (pdf) using the template here
Consider the last bit of this exercise on creating a Makefile "extra credit", here are some references for using GNU Make / Makefiles:
- Why Use Make? by Mike Bostock
- GNU Make for Reproducible Data Analysis by Zach Jones

Day 5

Predicting daily Citibike trips (open-ended)

The point of this exercise is to get experience in an open-ended prediction exercise: predicting the total number of Citibike trips taken on a given day. Create an RMarkdown file named predict_citibike.Rmd and do all of your work in it.

Here are the rules of the game:

Use the trips_per_day.tsv file that has one row for each day, the number of trips taken on that day, and the minimum temperature on that day.
Split the data into randomly selected training, validation, and test sets, with 90% of the data for training and validating the model, and 10% for a final test set (to be used once and only once towards the end of this exercise). You can adapt the code from last week's complexity control notebook to do this. When comparing possible models, you can use a single validation fold or k-fold cross-validation if you'd like a more robust estimate.
Start out with the model in that notebook, which uses only the minimum temperature on each day to predict the number of trips taken that day. Try different polynomial degrees in the minimum temperature and check that you get results similar to what's in that notebook, although they likely won't be identical due to shuffling of which days end up in the train, and validation splits. Quantify your performance using root mean-squared error.
Now get creative and extend the model to improve it. You can use any features you like that are available prior to the day in question, ranging from the weather, to the time of year and day of week, to activity in previous days or weeks, but don't cheat and use features from the future (e.g., the next day's trips). You can even try adding holiday effects. You might want to look at feature distributions to get a sense of what tranformations (e.g., log or manually created factors such as weekday vs. weekend) might improve model performance. You can also interact features with each other. This formula syntax in R reference might be useful.
Try a bunch of different models and ideas, documenting them in your Rmarkdown file. Inspect the models to figure out what the highly predictive features are, and see if you can prune away any negligble features that don't matter much. Report the model with the best performance on the validation data. Watch out for overfitting.
Plot your final best fit model in two different ways. First with the date on the x-axis and the number of trips on the y-axis, showing the actual values as points and predicted values as a line. Second as a plot where the x-axis is the predicted value and the y-axis is the actual value, with each point representing one day.
When you're convinced that you have your best model, clean up all your code so that it saves your best model in a .RData file using the save function.
Commit all of your changes to git, using git add -f to add the model .Rdata file if needed, and push to your Github repository.
Finally, use the model you just developed and pushed to Github to make predictions on the 10% of data you kept aside as a test set. Do this only once, and record the performance in your Rmarkdown file. Use this number to make a guess as to how your model will perform on future data (which we'll test it on!). Do you think it will do better, worse, or the same as it did on the 10% test set you used here? Write your answer in your Rmarkdown notebook. Render the notebook and push the final result to Github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Day 1

Regression (cont'd)

References

Day 2

Overfitting, generalization, and model complexity

Day 3

The long tail

Day 4

N-gram data and "Culturonomics"

Day 5

Predicting daily Citibike trips (open-ended)

FilesExpand file tree

week3

Directory actions

More options

Directory actions

More options

Latest commit

History

week3

Folders and files

parent directory

README.md

Day 1

Regression (cont'd)

References

Day 2

Overfitting, generalization, and model complexity

Day 3

The long tail

Day 4

N-gram data and "Culturonomics"

Day 5

Predicting daily Citibike trips (open-ended)