update poetry; fix issues #776 #778

Yorko · Jan 6, 2025 · 6670e35 · 6670e35
1 parent a6e3a6f
commit 6670e35
Show file tree

Hide file tree

Showing 4 changed files with 1,529 additions and 1,493 deletions.
diff --git a/mlcourse_ai_jupyter_book/book/topic04/topic04_intro.md b/mlcourse_ai_jupyter_book/book/topic04/topic04_intro.md
@@ -27,8 +27,8 @@ The following 5 articles may form a small brochure, and that's for a good reason
  - the [theory](https://youtu.be/ne-MfRfYs_c) behind linear models, an intuitive explanation;
  - [business case](https://youtu.be/B8yIaIEMyIc), where we discuss a real regression task – predicting customer Life-Time Value;
 
-4\. Complete [demo assignment 4](assignment04) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit)) on sarcasm detection;
+4\. Complete [demo assignment 4](assignment04) (same as a [Kaggle Notebook](https://www.kaggle.com/code/kashnitsky/a6-demo-linear-models-and-rf-for-regression)) where you explore OLS, Lasso and Random Forest in a regression task;
 
-5\. Check out the [solution](assignment04_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution)) to the demo assignment (optional);
+5\. Check out the [solution](assignment04_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/code/kashnitsky/a6-demo-regression-solution)) to the demo assignment (optional);
 
 6\. Complete [Bonus Assignment 4](https://www.patreon.com/ods_mlcourse) where you'll be guided through working with sparse data, feature engineering, model validation, and the process of competing on Kaggle. The task will be to beat baselines in that ["Alice" Kaggle competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). That's a very useful assignment for anyone starting to practice with Machine Learning, regardless of the desire to compete on Kaggle (optional, available under Patreon ["Bonus Assignments" tier](https://www.patreon.com/ods_mlcourse)).
diff --git a/...i_jupyter_book/book/topic04/topic4_linear_models_part5_valid_learning_curves.md b/...i_jupyter_book/book/topic04/topic4_linear_models_part5_valid_learning_curves.md
@@ -78,21 +78,21 @@ y = data["Churn"].astype("int").values
 X = data.drop("Churn", axis=1).values
 ```
 
-**We will train logistic regression with stochastic gradient descent. Later in the course, we will have a separate article on this topic.**
+**We will train an SVM with stochastic gradient descent. Later in the course, we will have a separate article on this topic.**
 
 
 ```{code-cell} ipython3
-alphas = np.logspace(-2, 0, 20)
-sgd_logit = SGDClassifier(loss="log", n_jobs=-1, random_state=17, max_iter=5)
+alphas = np.logspace(-4, 0, 20)
+sgd_model = SGDClassifier(loss="hinge", n_jobs=-1, random_state=17)
 logit_pipe = Pipeline(
     [
         ("scaler", StandardScaler()),
         ("poly", PolynomialFeatures(degree=2)),
-        ("sgd_logit", sgd_logit),
+        ("sgd_model", sgd_model),
     ]
 )
 val_train, val_test = validation_curve(
-    estimator=logit_pipe, X=X, y=y, param_name="sgd_logit__alpha", param_range=alphas, cv=5, scoring="roc_auc"
+    estimator=logit_pipe, X=X, y=y, param_name="sgd_model__alpha", param_range=alphas, cv=5, scoring="roc_auc"
 )
 ```
 
@@ -123,12 +123,10 @@ plt.grid(True);
 
 The trend is quite visible and is very common.
 
-- For simple models, training and validation errors are close and large. This suggests that the model **underfitted**, meaning it does not have a sufficient number of parameters.
+- For simple models, training and validation errors are close and large (conversely, metrics like ROC AUC or accuracy are low). This suggests that the model **underfitted**, meaning it does not have a sufficient number of parameters.
 
 - For highly sophisticated models, training and validation errors differ significantly. This can be explained by **overfitting**. When there are too many parameters or regularization is not strict enough, the algorithm can be "distracted" by the noise in the data and lose track of the overall trend.
 
-
-
 ### How much data is needed?
 
 The more data the model uses, the better. But how do we understand whether new data will helpful in any given situation? For example, is it rational to spend $N$ for assessors to double the dataset?
@@ -146,7 +144,7 @@ def plot_learning_curve(degree=2, alpha=0.01):
             ("scaler", StandardScaler()),
             ("poly", PolynomialFeatures(degree=degree)),
             (
-                "sgd_logit",
+                "sgd_model",
                 SGDClassifier(n_jobs=-1, random_state=17, alpha=alpha, max_iter=5),
             ),
         ]