Skip to content

Commit

Permalink
update poetry; fix issues #776 #778
Browse files Browse the repository at this point in the history
  • Loading branch information
Yury Kashnitsky committed Jan 6, 2025
1 parent a6e3a6f commit 6670e35
Show file tree
Hide file tree
Showing 4 changed files with 1,529 additions and 1,493 deletions.
4 changes: 2 additions & 2 deletions mlcourse_ai_jupyter_book/book/topic04/topic04_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ The following 5 articles may form a small brochure, and that's for a good reason
- the [theory](https://youtu.be/ne-MfRfYs_c) behind linear models, an intuitive explanation;
- [business case](https://youtu.be/B8yIaIEMyIc), where we discuss a real regression task – predicting customer Life-Time Value;

4\. Complete [demo assignment 4](assignment04) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit)) on sarcasm detection;
4\. Complete [demo assignment 4](assignment04) (same as a [Kaggle Notebook](https://www.kaggle.com/code/kashnitsky/a6-demo-linear-models-and-rf-for-regression)) where you explore OLS, Lasso and Random Forest in a regression task;

5\. Check out the [solution](assignment04_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution)) to the demo assignment (optional);
5\. Check out the [solution](assignment04_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/code/kashnitsky/a6-demo-regression-solution)) to the demo assignment (optional);

6\. Complete [Bonus Assignment 4](https://www.patreon.com/ods_mlcourse) where you'll be guided through working with sparse data, feature engineering, model validation, and the process of competing on Kaggle. The task will be to beat baselines in that ["Alice" Kaggle competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). That's a very useful assignment for anyone starting to practice with Machine Learning, regardless of the desire to compete on Kaggle (optional, available under Patreon ["Bonus Assignments" tier](https://www.patreon.com/ods_mlcourse)).
Original file line number Diff line number Diff line change
Expand Up @@ -78,21 +78,21 @@ y = data["Churn"].astype("int").values
X = data.drop("Churn", axis=1).values
```

**We will train logistic regression with stochastic gradient descent. Later in the course, we will have a separate article on this topic.**
**We will train an SVM with stochastic gradient descent. Later in the course, we will have a separate article on this topic.**


```{code-cell} ipython3
alphas = np.logspace(-2, 0, 20)
sgd_logit = SGDClassifier(loss="log", n_jobs=-1, random_state=17, max_iter=5)
alphas = np.logspace(-4, 0, 20)
sgd_model = SGDClassifier(loss="hinge", n_jobs=-1, random_state=17)
logit_pipe = Pipeline(
[
("scaler", StandardScaler()),
("poly", PolynomialFeatures(degree=2)),
("sgd_logit", sgd_logit),
("sgd_model", sgd_model),
]
)
val_train, val_test = validation_curve(
estimator=logit_pipe, X=X, y=y, param_name="sgd_logit__alpha", param_range=alphas, cv=5, scoring="roc_auc"
estimator=logit_pipe, X=X, y=y, param_name="sgd_model__alpha", param_range=alphas, cv=5, scoring="roc_auc"
)
```

Expand Down Expand Up @@ -123,12 +123,10 @@ plt.grid(True);

The trend is quite visible and is very common.

- For simple models, training and validation errors are close and large. This suggests that the model **underfitted**, meaning it does not have a sufficient number of parameters.
- For simple models, training and validation errors are close and large (conversely, metrics like ROC AUC or accuracy are low). This suggests that the model **underfitted**, meaning it does not have a sufficient number of parameters.

- For highly sophisticated models, training and validation errors differ significantly. This can be explained by **overfitting**. When there are too many parameters or regularization is not strict enough, the algorithm can be "distracted" by the noise in the data and lose track of the overall trend.



### How much data is needed?

The more data the model uses, the better. But how do we understand whether new data will helpful in any given situation? For example, is it rational to spend $N$ for assessors to double the dataset?
Expand All @@ -146,7 +144,7 @@ def plot_learning_curve(degree=2, alpha=0.01):
("scaler", StandardScaler()),
("poly", PolynomialFeatures(degree=degree)),
(
"sgd_logit",
"sgd_model",
SGDClassifier(n_jobs=-1, random_state=17, alpha=alpha, max_iter=5),
),
]
Expand Down
Loading

0 comments on commit 6670e35

Please sign in to comment.