Homework Discussions for the Real World Classification Workshop

Discuss the homework here!

Here is the homework submission form: https://forms.gle/G16sGFQLjtt1jKAx8

If you’d like to get a certificate for this workshop, you must attempt the quiz questions in the Google Form and upload your homework solution notebook (the extension of which must be .ipynb).

Screenshot 2022-03-05 131258
The xgb_imp model has a slightly higher accuracy while the xgb model has a slightly higher AUC. So, how can we decide which is a “better” model in this case? Also, is there any reason that xgb has higher AUC than xgb_imp? Or is that just random, and maybe with some different data I can get AUC higher in xgb_imp for this same set of features?
Lastly, xgb_ohe does considerably worse job here, why is that? Is that because of overfitting? I think the ohe should do a better job dealing with categorical data such as states and cities rather than labeling. Also, the combination of the two of them should work even better, right? But, I don’t think any of those two method do a good job dealing with the ‘city’ column, as ohe would make way too many features and labelling is going to assert that they have some sort of order, I think. Maybe the frequency-one could help here? Can you explain?

@tejas55

  1. Xgb is a stochastic model. Hence performance could vary every time you run the model.
  2. XGB itself handles imputation algorithmically, hence the simpler impute approach may not fare well.
  3. OHE has the problem of sparse matrix.
  4. Freq approach could work. You can try and see.
  5. In terms of metrics go for AUC and AUC PR and not accuracy. Accuracy is not the right metric.
  6. Try cross validation and multiple out of sample sets to get low bias and low variance models.