Project 3

Q1: Download the dataset charleston_ask.csv and import it into your PyCharm project workspace. Specify and train a model the designates the asking price as your target variable and beds, baths and area (in square feet) as your features. Train and test your target and features using a linear regression model. Describe how your model performed. What were the training and testing scores you produced? How many folds did you assign when partitioning your training and testing data? Interpret and assess your output.

After I imported the data and extracted features and targets, I used a linear regression model with K-fold Validation of 10 folds. The average training score is 0.019 and the average testing score is -0.058. Both scores are really low, indicating the poor predicative power of this model. Also, the training score is higher than the testing score, which implies this model is overfitting. There might be three possible reasons for this poor score. First, the features we chose (beds, baths and area) are not good predictors of the asking price. Zip code which indicates the geographical location of the house can be added to the features. Second, these three features are not in same scale, so standardization of all these features might help. Third, a linear regression model might not be appropriate for this data and we need a new model. To improve the model’s predictive power, we should try all of them.

Q2: Now standardize your features (again beds, baths and area) prior to training and testing with a linear regression model (also again with asking price as your target). Now how did your model perform? What were the training and testing scores you produced? How many folds did you assign when partitioning your training and testing data? Interpret and assess your output.

After I standardized all the features by the standard scaler, and used a linear regression model with K-fold Validation of 10 folds again, both scores remained low. The average training score is 0.019 and the average testing score is -0.003. It seems that standardization does not improve the predicative power of the model at all, so we may try to apply a ridge regression model instead of a linear regression for the next step.

Q3: Then train your dataset with the asking price as your target using a ridge regression model. Now how did your model perform? What were the training and testing scores you produced? Did you standardize the data? Interpret and assess your output.

Unfortunately, the ridge regression model with K-fold Validation of 10 folds is still unable to improve both scores. At the optimal alpha value, the average training score is 0.017 and the average testing score is 0.011. I standardized the data just as previously. Although the average testing score became positive, this is still a poor score and it further implies that the features we chose may not be good predictors of the asking prices. Thus, we can either modify the selected features or use the actual sale prices as target instead of asking prices, because there might be a gap between these two prices.

Q4: Next, go back, train and test each of the three previous model types/specifications, but this time use the dataset charleston_act.csv (actual sale prices). How did each of these three models perform after using the dataset that replaced asking price with the actual sale price? What were the training and testing scores you produced? Interpret and assess your output.

After importing the data set that replaced asking prices with actual sale prices, I tested three models with K-fold Validation of 10 folds on this data. However, there was no improvement in the results. For the linear regression model without standardization, the average training score is 0.004 and the average testing score is -0.019. For the linear regression model after standardization, the average training score is 0.004 and the average testing score is -0.024. For the ridge regression model, the average training score is 0.002 and the average testing score is -0.004. The predicative powers of all three models based on the actual sale price are still really low, and this indicates that the features we chosen are not good predictors for the actual sale price. Thus, we should add the zip code variable to our features and see if that makes change.

Q5: Go back and also add the variables that indicate the zip code where each individual home is located within Charleston County, South Carolina. Train and test each of the three previous model types/specifications. What was the predictive power of each model? Interpret and assess your output.

After adding the zip code variable to our features and using the actual sale price as target, I tested all three models with K-fold Validation of 10 folds again on this data. For the linear regression model without standardization, the results improved significantly with average training score of 0.339 and average testing score of 0.255. In the linear regression model after standardization, the average training score improved and became 0.339. However, its average testing score is an incredibly huge negative number: -597511427936917942960128.000. I tried to exclude the outliers of testing scores when calculating the mean, but this did not make much difference and the average testing score is still a huge negative number. For the ridge regression model, both results improved significantly with average training score of 0.331 and average testing score of 0.278. After adding the zip code to features, the predicative power of both linear regression model without standardization and ridge regression model improved. This indicates that the geographical location of the house is a strong predictor which largely correlates with the actual sale price. The incredibly huge negative testing score for the linear regression model after standardization may result from the binary property of zip code.

Q6:Finally, consider the model that produced the best results. Would you estimate this model as being overfit or underfit? If you were working for Zillow as their chief data scientist, what action would you recommend in order to improve the predictive power of the model that produced your best results from the approximately 700 observations (716 asking / 660 actual)?

The ridge regression model using zip code feature and actual sale price produced the best results because it has the highest average testing score among all models. Since the best average testing score is still lower than the average training score, it is an overfit model. If I am working for Zillow, I might recommend to figure out other related variables like distance to subway or the type of the buildings and include them in our features when creating the model to further improve its predicative power.