Release v0.13.0 (#688)

mlr-org · Nov 16, 2021 · 96008d0 · 96008d0
1 parent 680636c
commit 96008d0
Show file tree

Hide file tree

Showing 74 changed files with 307 additions and 213 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: mlr3
 Title: Machine Learning in R - Next Generation
-Version: 0.12.0-9000
+Version: 0.13.0
 Authors@R:
     c(person(given = "Michel",
              family = "Lang",

diff --git a/NEWS.md b/NEWS.md
@@ -1,9 +1,23 @@
 # mlr3 0.13.0
 
+* Learners which are capable of resuming/continuing (e.g.,
+  learner `(classif|regr|surv).xgboost` with hyperparameter `nrounds` updated)
+  can now optionally store a stack of trained learners to be used to hotstart
+  their training. Note that this feature is still somewhat experimental.
+  See `HotstartStack` and #719.
+* New measures to score similarity of selected feature sets:
+  `sim.jaccard` (Jaccard Index) and `sim.phi` (Phi coefficient) (#690).
+* `predict_newdata()` now also supports `DataBackend` as input.
 * New function `install_pkgs()` to install required packages. This generic works
   for all objects with a `packages` field as well as `ResampleResult` and
-  `BenchmarkResult`.
-
+  `BenchmarkResult` (#728).
+* New learner `regr.debug` for debugging.
+* New `Task` method `$set_levels()` to control how data with factor columns
+  is returned, independent of the used `DataBackend`.
+* Measures now return `NA` if prerequisite are not met (#699).
+  This allows to conveniently score your experiments with  multiple measures
+  having different requirements.
+* Feature names may no longer contain the special character `%`.
 
 # mlr3 0.12.0
 

diff --git a/R/DataBackend.R b/R/DataBackend.R
@@ -52,7 +52,7 @@ DataBackend = R6Class("DataBackend", cloneable = FALSE,
     #' [DataBackendDataTable] or [DataBackendMatrix], or via the S3 method
     #' [as_data_backend()].
     #'
-    #' @param data (`any`)\cr
+    #' @param data (any)\cr
     #'   The format of the input data depends on the specialization. E.g.,
     #'   [DataBackendDataTable] expects a [data.table::data.table()] and
     #'   [DataBackendMatrix] expects a [Matrix::Matrix()] from \CRANpkg{Matrix}.

diff --git a/R/HotstartStack.R b/R/HotstartStack.R
@@ -1,21 +1,30 @@
 #' @title Stack for Hot Start Learners
 #'
 #' @description
-#' This class stores learners for hot starting. When fitting a learner
-#' repeatedly on the same task but with a different fidelity, hot starting
-#' accelerates model fitting by reusing previously fitted models. For example,
-#' add more trees to a fitted random forest model.
+#' This class stores learners for hot starting training, i.e. resuming or
+#' continuing from an already fitted model.
+#' We assume that hot starting is only possible if a single hyperparameter
+#' (also called the fidelity parameter, usually controlling the complexity or
+#' expensiveness) is altered and all other hyperparameters are identical.
 #'
 #' The `HotstartStack` stores trained learners which can be potentially used to
 #' hot start a learner. Learner automatically hot start while training if a
 #' stack is attached to the `$hotstart_stack` field and the stack contains a
-#' suitable learner (see examples).
+#' suitable learner.
+#'
+#' For example, if you want to train a random forest learner with 1000 trees but
+#' already have a random forest learner with 500 trees (hot start learner),
+#' you can add the hot start learner to the `HotstartStack` of the expensive learner
+#' with 1000 trees. If you now call the `train()` method (or [resample()] or
+#' [benchmark()]), a random forest with 500 trees will be fitted and combined
+#' with the 500 trees of the hotstart learner, effectively saving you to
+#' fit 500 trees.
 #'
 #' Hot starting is only supported by learners which have the property
-#' `"hotstart_forward"` or `"hotstart_backward"`. For example, an xgboost model
-#' can hot start forward by adding more boosting iterations and a random forest
-#' can go backwards by removing trees. The fidelity parameters are tagged with
-#' `"hotstart"` in learner's parameter set.
+#' `"hotstart_forward"` or `"hotstart_backward"`. For example, an `xgboost` model
+#' (in \CRANpkg{mlr3learners}) can hot start forward by adding more boosting
+#' iterations, and a random forest can go backwards by removing trees.
+#' The fidelity parameters are tagged with `"hotstart"` in learner's parameter set.
 #'
 #' @export
 #' @examples
@@ -118,7 +127,7 @@ HotstartStack = R6Class("HotstartStack",
     #' Printer.
     #'
     #' @param ... (ignored).
-    print = function() {
+    print = function(...) {
       catf(format(self))
       print(self$stack, digits = 2)
     }

diff --git a/R/Learner.R b/R/Learner.R
@@ -182,7 +182,7 @@ Learner = R6Class("Learner",
     #' @description
     #' Printer.
     #' @param ... (ignored).
-    print = function() {
+    print = function(...) {
       catn(format(self))
       catn(str_indent("* Model:", if (is.null(self$model)) "-" else class(self$model)[1L]))
       catn(str_indent("* Parameters:", as_short_string(self$param_set$values, 1000L)))
@@ -371,7 +371,7 @@ Learner = R6Class("Learner",
   ),
 
   active = list(
-    #' @field model (`any`)\cr
+    #' @field model (any)\cr
     #' The fitted model. Only available after `$train()` has been called.
     model = function(rhs) {
       assert_ro_binding(rhs)

diff --git a/R/Measure.R b/R/Measure.R
@@ -104,13 +104,16 @@ Measure = R6Class("Measure",
 
       if (!is_scalar_na(task_type)) {
         assert_choice(task_type, mlr_reflections$task_types$type)
+        assert_subset(properties, mlr_reflections$measure_properties[[task_type]])
         assert_choice(predict_type, names(mlr_reflections$learner_predict_types[[task_type]]))
         assert_subset(properties, mlr_reflections$measure_properties[[task_type]])
+        assert_subset(task_properties, mlr_reflections$task_properties[[task_type]])
       }
-      self$properties = properties
+
+      self$properties = unique(properties)
       self$predict_type = predict_type
       self$predict_sets = assert_subset(predict_sets, mlr_reflections$predict_sets, empty.ok = FALSE)
-      self$task_properties = assert_subset(task_properties, mlr_reflections$task_properties[[task_type]])
+      self$task_properties = task_properties
       self$packages = union("mlr3", assert_character(packages, any.missing = FALSE, min.chars = 1L))
       self$man = assert_string(man, na.ok = TRUE)
 
@@ -126,7 +129,7 @@ Measure = R6Class("Measure",
     #' @description
     #' Printer.
     #' @param ... (ignored).
-    print = function() {
+    print = function(...) {
       catn(format(self))
       catn(str_indent("* Packages:", self$packages))
       catn(str_indent("* Range:", sprintf("[%g, %g]", self$range[1L], self$range[2L])))

diff --git a/R/MeasureSimilarity.R b/R/MeasureSimilarity.R
@@ -5,12 +5,14 @@
 #' @description
 #' This measure specializes [Measure] for measures quantifying the similarity of
 #' sets of selected features.
+#' To calculate similarity measures, the [Learner] must have the property
+#' `"selected_features"`.
 #'
 #' * `task_type` is set to `NA_character_`.
 #' * `average` is set to `"custom"`.
 #'
-#' Predefined measures can be found in the [dictionary][mlr3misc::Dictionary] [mlr_measures].
-#' The default measure for regression is [`regr.mse`][mlr_measures_regr.mse].
+#' Predefined measures can be found in the [dictionary][mlr3misc::Dictionary]
+#' [mlr_measures], prefixed with `"sim."`.
 #'
 #' @template param_id
 #' @template param_param_set
@@ -27,14 +29,24 @@
 #'
 #' @template seealso_measure
 #' @export
+#' @examples
+#' task = tsk("penguins")
+#' learners = list(
+#'   lrn("classif.rpart", maxdepth = 1, id = "r1"),
+#'   lrn("classif.rpart", maxdepth = 2, id = "r2")
+#' )
+#' resampling = rsmp("cv", folds = 3)
+#' grid = benchmark_grid(task, learners, resampling)
+#' bmr = benchmark(grid, store_models = TRUE)
+#' bmr$aggregate(msrs(c("classif.ce", "sim.jaccard")))
 MeasureSimilarity = R6Class("MeasureSimilarity", inherit = Measure, cloneable = FALSE,
   public = list(
     #' @description
     #' Creates a new instance of this [R6][R6::R6Class] class.
     initialize = function(id, param_set = ps(), range, minimize = NA, average = "macro", aggregator = NULL, properties = character(), predict_type = "response",
       predict_sets = "test", task_properties = character(), packages = character(), man = NA_character_) {
       super$initialize(id, task_type = NA_character_, param_set = param_set, range = range, minimize = minimize, average = "custom", aggregator = aggregator,
-        properties = properties, predict_type = predict_type, predict_sets = predict_sets,
+        properties = c("requires_model", properties), predict_type = predict_type, predict_sets = predict_sets,
         task_properties = task_properties, packages = packages, man = man)
     }
   ),

diff --git a/R/Prediction.R b/R/Prediction.R
@@ -116,7 +116,7 @@ Prediction = R6Class("Prediction",
       self$data$row_ids
     },
 
-    #' @field truth (`any`)\cr
+    #' @field truth (any)\cr
     #'   True (observed) outcome.
     truth = function(rhs) {
       assert_ro_binding(rhs)

diff --git a/R/ResampleResult.R b/R/ResampleResult.R
@@ -60,7 +60,7 @@ ResampleResult = R6Class("ResampleResult",
     #' @description
     #' Printer.
     #' @param ... (ignored).
-    print = function() {
+    print = function(...) {
       catf("%s of %i iterations", format(self), self$iters)
       catn(str_indent("* Task:", self$task$id))
       catn(str_indent("* Learner:", self$learner$id))

diff --git a/R/Resampling.R b/R/Resampling.R
@@ -87,7 +87,7 @@ Resampling = R6Class("Resampling",
     #' @template field_param_set
     param_set = NULL,
 
-    #' @field instance (`any`)\cr
+    #' @field instance (any)\cr
     #'   During `instantiate()`, the instance is stored in this slot in an arbitrary format.
     #'   Note that if a grouping variable is present in the [Task], a [Resampling] may operate on the
     #'   group ids internally instead of the row ids (which may lead to confusion).

diff --git a/R/Task.R b/R/Task.R
@@ -466,13 +466,15 @@ Task = R6Class("Task",
         assert_set_equal(self$row_ids, data$rownames)
       }
 
+      # update col_info for existing columns
       ci = col_info(data)
-      ci$label = NA_character_
-      ci$fix_factor_levels = FALSE
-
-      # update col info
       self$col_info = ujoin(self$col_info, ci, key = "id")
-      self$col_info = rbindlist(list(self$col_info, ci[!list(self$col_info), on = "id"]), use.names = TRUE, fill = TRUE)
+
+      # add rows to col_info for new columns
+      self$col_info = rbindlist(list(
+        self$col_info,
+        insert_named(ci[!list(self$col_info), on = "id"], list(label = NA_character_, fix_factor_levels = FALSE))
+      ), use.names = TRUE)
       setkeyv(self$col_info, "id")
 
       # add new features

diff --git a/R/as_data_backend.R b/R/as_data_backend.R
@@ -8,7 +8,7 @@
 #' Additional methods are implemented in the package \CRANpkg{mlr3db}, e.g. to connect
 #' to real DBMS like PostgreSQL (via \CRANpkg{dbplyr}) or DuckDB (via \CRANpkg{DBI}/\CRANpkg{duckdb}).
 #'
-#' @param data `any`\cr
+#' @param data (any)\cr
 #'   Data to create a [DataBackend] from.
 #'   For a `data.frame()` (this includes `tibble()` from \CRANpkg{tibble} and [data.table::data.table()]),
 #'   a [DataBackendDataTable] is created.
@@ -17,7 +17,7 @@
 #'
 #' @template param_primary_key
 #'
-#' @param ... (`any`)\cr
+#' @param ... (any)\cr
 #'   Additional arguments passed to the respective [DataBackend] method.
 #'
 #' @return [DataBackend].

diff --git a/R/as_resample_result.R b/R/as_resample_result.R
@@ -3,9 +3,9 @@
 #' @description
 #' Convert object to a [ResampleResult].
 #'
-#' @param x (`any`)\cr
+#' @param x (any)\cr
 #'  Object to convert.
-#' @param ... (`any`)\cr
+#' @param ... (any)\cr
 #'  Currently not used.
 #'
 #' @return ([ResampleResult]).

diff --git a/R/as_task.R b/R/as_task.R
@@ -3,37 +3,37 @@
 #' @description
 #' Convert object to a [Task] or a list of [Task].
 #'
-#' @param x (`any`)\cr
+#' @param x (any)\cr
 #'   Object to convert.
-#' @param ... (`any`)\cr
+#' @param ... (any)\cr
 #'   Additional arguments.
-#' @param clone (`logical(1)`)\cr
-#'   If `TRUE`, ensures that the returned object is not the same as the input `x`.
 #' @export
 as_task = function(x, ...) {
   UseMethod("as_task")
 }
 
-#' @export
 #' @rdname as_task
+#' @export
 as_task.Task = function(x, clone = FALSE, ...) { # nolint
   if (clone) x$clone() else x
 }
 
-#' @export
 #' @rdname as_task
+#' @export
 as_tasks = function(x, ...) {
   UseMethod("as_tasks")
 }
 
-#' @export
 #' @rdname as_task
+#' @param clone (`logical(1)`)\cr
+#'   If `TRUE`, ensures that the returned object is not the same as the input `x`.
+#' @export
 as_tasks.list = function(x, clone = FALSE, ...) { # nolint
   lapply(x, as_task, clone = clone, ...)
 }
 
-#' @export
 #' @rdname as_task
+#' @export
 as_tasks.Task = function(x, clone = FALSE, ...) { # nolint
   list(if (clone) x$clone() else x)
 }
diff --git a/R/auto_convert.R b/R/auto_convert.R
@@ -122,7 +122,7 @@ rm(ee)
 #'
 #' All rules are stored as functions in [mlr_reflections$auto_converters][mlr_reflections].
 #'
-#' @param value (`any`)\cr
+#' @param value (any)\cr
 #'   New values to convert in order to match `type`.
 #' @param id (`character(1)`)\cr
 #'   Name of the column, used in error messages.

diff --git a/R/install_pkgs.R b/R/install_pkgs.R
@@ -17,7 +17,7 @@
 #'
 #' @param x (any)\cr
 #'   Object with package information (or a list of such objects).
-#' @param ... \cr
+#' @param ... (any)\cr
 #'   Additional arguments passed down to [remotes::install_cran()] or
 #'   [remotes::install_github()].
 #'   Arguments `force` and `upgrade` are often important in this context.

diff --git a/R/mlr_reflections.R b/R/mlr_reflections.R
@@ -126,7 +126,7 @@ local({
 
 
   ### Measures
-  tmp = c("na_score", "requires_task", "requires_learner", "requires_train_set")
+  tmp = c("na_score", "requires_task", "requires_learner", "requires_model", "requires_train_set")
   mlr_reflections$measure_properties = list(
     classif = tmp,
     regr = tmp

diff --git a/R/predict.R b/R/predict.R
@@ -20,7 +20,7 @@
 #'   Set to `<Prediction>` to retrieve the complete [Prediction] object.
 #'   If set to `NULL` (default), the first predict type for the respective class of the [Learner]
 #'   as stored in [mlr_reflections] is used.
-#' @param ... (`any`)\cr
+#' @param ... (any)\cr
 #'   Hyperparameters to pass down to the [Learner].
 #'
 #' @export

diff --git a/R/set_threads.R b/R/set_threads.R
@@ -16,7 +16,7 @@
 #' via the [future::plan] [future::multicore]. For this reason all learners connected to \CRANpkg{mlr3}
 #' have threading disabled in their defaults.
 #'
-#' @param x (`any`)\cr
+#' @param x (any)\cr
 #'   Object to set threads for, e.g. a [Learner].
 #'   This object is modified in-place.
 #' @param n (`integer(1)`)\cr

diff --git a/README.Rmd b/README.Rmd
@@ -47,7 +47,7 @@ Successor of [mlr](https://github.com/mlr-org/mlr).
   - [useR2019 talk on mlr3pipelines and mlr3tuning](https://www.youtube.com/watch?v=gEW5RxkbQuQ)
   - [useR2020 tutorial on mlr3, mlr3tuning and mlr3pipelines](https://www.youtube.com/watch?v=T43hO2o_nZw)
 * **Courses/Lectures**
-  - The course [Introduction to Machine learning (I2ML)](https://compstat-lmu.github.io/lecture_i2ml/) is a free and open flipped classroom course on the basics of machine learning. `mlr3` is used in the [demos](https://github.com/compstat-lmu/lecture_i2ml/tree/master/code-demos-pdf) and [exercises](https://github.com/compstat-lmu/lecture_i2ml/tree/master/exercises).
+  - The course [Introduction to Machine learning (I2ML)](https://introduction-to-machine-learning.netlify.app/) is a free and open flipped classroom course on the basics of machine learning. `mlr3` is used in the [demos](https://github.com/slds-lmu/lecture_i2ml/tree/master/code-demos-pdf) and [exercises](https://github.com/slds-lmu/lecture_i2ml/tree/master/exercises).
 * **Templates/Tutorials**
   - [mlr3-learndrake](https://github.com/mlr-org/mlr3-learndrake): Shows how to use mlr3 with [drake](https://docs.ropensci.org/drake/) for reproducible ML workflow automation.
 * [List of extension packages](https://github.com/mlr-org/mlr3/wiki/Extension-Packages)