Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update poetry; fix issues #776 #778 #786

Merged
merged 1 commit into from
Jan 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions mlcourse_ai_jupyter_book/book/topic04/topic04_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ The following 5 articles may form a small brochure, and that's for a good reason
- the [theory](https://youtu.be/ne-MfRfYs_c) behind linear models, an intuitive explanation;
- [business case](https://youtu.be/B8yIaIEMyIc), where we discuss a real regression task – predicting customer Life-Time Value;

4\. Complete [demo assignment 4](assignment04) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit)) on sarcasm detection;
4\. Complete [demo assignment 4](assignment04) (same as a [Kaggle Notebook](https://www.kaggle.com/code/kashnitsky/a6-demo-linear-models-and-rf-for-regression)) where you explore OLS, Lasso and Random Forest in a regression task;

5\. Check out the [solution](assignment04_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution)) to the demo assignment (optional);
5\. Check out the [solution](assignment04_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/code/kashnitsky/a6-demo-regression-solution)) to the demo assignment (optional);

6\. Complete [Bonus Assignment 4](https://www.patreon.com/ods_mlcourse) where you'll be guided through working with sparse data, feature engineering, model validation, and the process of competing on Kaggle. The task will be to beat baselines in that ["Alice" Kaggle competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). That's a very useful assignment for anyone starting to practice with Machine Learning, regardless of the desire to compete on Kaggle (optional, available under Patreon ["Bonus Assignments" tier](https://www.patreon.com/ods_mlcourse)).
Original file line number Diff line number Diff line change
Expand Up @@ -78,21 +78,21 @@ y = data["Churn"].astype("int").values
X = data.drop("Churn", axis=1).values
```

**We will train logistic regression with stochastic gradient descent. Later in the course, we will have a separate article on this topic.**
**We will train an SVM with stochastic gradient descent. Later in the course, we will have a separate article on this topic.**


```{code-cell} ipython3
alphas = np.logspace(-2, 0, 20)
sgd_logit = SGDClassifier(loss="log", n_jobs=-1, random_state=17, max_iter=5)
alphas = np.logspace(-4, 0, 20)
sgd_model = SGDClassifier(loss="hinge", n_jobs=-1, random_state=17)
logit_pipe = Pipeline(
[
("scaler", StandardScaler()),
("poly", PolynomialFeatures(degree=2)),
("sgd_logit", sgd_logit),
("sgd_model", sgd_model),
]
)
val_train, val_test = validation_curve(
estimator=logit_pipe, X=X, y=y, param_name="sgd_logit__alpha", param_range=alphas, cv=5, scoring="roc_auc"
estimator=logit_pipe, X=X, y=y, param_name="sgd_model__alpha", param_range=alphas, cv=5, scoring="roc_auc"
)
```

Expand Down Expand Up @@ -123,12 +123,10 @@ plt.grid(True);

The trend is quite visible and is very common.

- For simple models, training and validation errors are close and large. This suggests that the model **underfitted**, meaning it does not have a sufficient number of parameters.
- For simple models, training and validation errors are close and large (conversely, metrics like ROC AUC or accuracy are low). This suggests that the model **underfitted**, meaning it does not have a sufficient number of parameters.

- For highly sophisticated models, training and validation errors differ significantly. This can be explained by **overfitting**. When there are too many parameters or regularization is not strict enough, the algorithm can be "distracted" by the noise in the data and lose track of the overall trend.



### How much data is needed?

The more data the model uses, the better. But how do we understand whether new data will helpful in any given situation? For example, is it rational to spend $N$ for assessors to double the dataset?
Expand All @@ -146,7 +144,7 @@ def plot_learning_curve(degree=2, alpha=0.01):
("scaler", StandardScaler()),
("poly", PolynomialFeatures(degree=degree)),
(
"sgd_logit",
"sgd_model",
SGDClassifier(n_jobs=-1, random_state=17, alpha=alpha, max_iter=5),
),
]
Expand Down
Loading
Loading