Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

organizing the filter/ranking methods #8

Open
topepo opened this issue Jul 10, 2023 · 5 comments
Open

organizing the filter/ranking methods #8

topepo opened this issue Jul 10, 2023 · 5 comments

Comments

@topepo
Copy link
Contributor

topepo commented Jul 10, 2023

Can we design some common infrastructure across filter methods? In other words the underlying filtering methods have parameters on the inputs (e.g. types of variables allowed) and the outputs (such as minimize/maximize), and so on? This is not unlike how yardstick organizes performance metrics.

One of the goals is to be able to make composite filters (e.g. maximize ROC AUC and pick the three largest importance scores). I have a private package that I've been kicking around for a while (ironically called colander - I'll send you an invite) that was a prototype for these types of filters.

If we had modular methods with control filter names, we could also reduce the total number of steps and have new ones that are flexible and work across all filter methods:

step_rank_predictors(
    all_predictors(),
    method = "rf_imp",
    top_p = 5
  )

# or 

step_filter_predictors(
    all_predictors(),
    filter = rf_imp > 3 & roc_auc >= .8
  )

I also have some working code to use desirability functions:

step_rank_desirability(
    all_predictors(),
    eqn = d_max(rf_imp, 0, 5) + d_min(pval_anova, -10, -1, scale = 1/2),
    top_p = 2
  )

The organizational parts in colander are not all that great right now but I think that the idea is a good one.

@topepo
Copy link
Contributor Author

topepo commented Jul 10, 2023

Here's a straw man constructor for new methods:

new_filter_method <- function(name, label, goal = "maximize", 
                              inputs = "all", outputs = "all", pkgs) {
  # name: a keyword used in other steps (e.g. rf_imp or similar)
  # label: for printing ("random forest variable importance")

  goal <- rlang::arg_match0(goal, c("maximize", "minimize", "zero", "target"))
  
  # Specifications for inputs and output variables
  # Maybe these should be more specific (e.g. "factor", "numeric", etc). 
  # Should also specify max levels for factor inputs or outputs? 
  inputs  <- rlang::arg_match0(inputs,  c("all", "qualitative", "quantitative"))
  outputs <- rlang::arg_match0(outputs, c("all", "qualitative", "quantitative"))
  
  # pkgs: character string of external packages used to compute the filter

  # maybe also set default arguments and a list that can't be altered by the user? 
  res <- 
    list(
      name = name,
      label = label,
      goal = goal,
      inputs = inputs,
      outputs = outputs,
      pkgs = pkgs
    )
  class(res) <- c(paste0("filter_method_", outputs), "filter_method")
  res
}

@topepo
Copy link
Contributor Author

topepo commented Jul 11, 2023

I had a long train ride and did a draft implementation (in topepo/colino fork) for a few methods:

library(tidymodels)
library(colino)     # remotes::install_github("topepo/colino")
tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
data(cells)
cells$case <- NULL

fit_xy(
  colino:::filter_roc_auc,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                     score
#>    <chr>                        <dbl>
#>  1 fiber_width_ch_1             0.833
#>  2 total_inten_ch_2             0.805
#>  3 total_inten_ch_1             0.790
#>  4 shape_p_2_a_ch_1             0.786
#>  5 avg_inten_ch_2               0.777
#>  6 convex_hull_area_ratio_ch_1  0.772
#>  7 avg_inten_ch_1               0.760
#>  8 entropy_inten_ch_1           0.759
#>  9 convex_hull_perim_ratio_ch_1 0.747
#> 10 var_inten_ch_1               0.727
#> # ℹ 46 more rows

fit_xy(
  colino:::filter_mrmr,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                       score
#>    <chr>                          <dbl>
#>  1 total_inten_ch_4              0.644 
#>  2 entropy_inten_ch_1           -0.0736
#>  3 avg_inten_ch_2               -0.0740
#>  4 skew_inten_ch_4              -0.0754
#>  5 convex_hull_perim_ratio_ch_1 -0.0761
#>  6 shape_bfr_ch_1               -0.0764
#>  7 inten_cooc_contrast_ch_3     -0.0772
#>  8 eq_sphere_vol_ch_1           -0.0779
#>  9 spot_fiber_count_ch_4        -0.0783
#> 10 diff_inten_density_ch_1      -0.0801
#> # ℹ 46 more rows

fit_xy(
  colino:::filter_info_gain,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                      score
#>    <chr>                         <dbl>
#>  1 total_inten_ch_2             0.189 
#>  2 fiber_width_ch_1             0.174 
#>  3 avg_inten_ch_2               0.137 
#>  4 shape_p_2_a_ch_1             0.130 
#>  5 total_inten_ch_1             0.130 
#>  6 convex_hull_area_ratio_ch_1  0.112 
#>  7 avg_inten_ch_1               0.109 
#>  8 entropy_inten_ch_1           0.103 
#>  9 skew_inten_ch_1              0.0922
#> 10 convex_hull_perim_ratio_ch_1 0.0898
#> # ℹ 46 more rows

fit_xy(
  colino:::filter_info_gain_ratio,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                     score
#>    <chr>                        <dbl>
#>  1 total_inten_ch_2            0.158 
#>  2 fiber_width_ch_1            0.126 
#>  3 avg_inten_ch_2              0.106 
#>  4 total_inten_ch_1            0.0982
#>  5 shape_p_2_a_ch_1            0.0978
#>  6 convex_hull_area_ratio_ch_1 0.0855
#>  7 avg_inten_ch_1              0.0828
#>  8 fiber_length_ch_1           0.0823
#>  9 skew_inten_ch_1             0.0767
#> 10 entropy_inten_ch_1          0.0754
#> # ℹ 46 more rows
data(ames)
ames$Sale_Price <- log10(ames$Sale_Price)

num_col <- c("Longitude", "Latitude", "Year_Built", "Lot_Area", "Gr_Liv_Area")
fac_col <- c("MS_Zoning", "Central_Air", "Neighborhood")

fit_xy(
  colino:::filter_corr,
  x = ames %>% select(all_of(num_col)),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#>   variable    score
#>   <chr>       <dbl>
#> 1 Gr_Liv_Area 0.696
#> 2 Year_Built  0.615
#> 3 Longitude   0.292
#> 4 Latitude    0.286
#> 5 Lot_Area    0.255

fit_xy(
  colino:::filter_max_diff,
  x = ames %>% select(all_of(fac_col)),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 3 × 2
#>   variable     score
#>   <chr>        <dbl>
#> 1 MS_Zoning    0.814
#> 2 Neighborhood 0.531
#> 3 Central_Air  0.262

fit_xy(
  colino:::filter_rf_imp,
  x = ames %>% select(all_of(c(fac_col, num_col))),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 8 × 2
#>   variable     score
#>   <chr>        <dbl>
#> 1 Gr_Liv_Area  18.4 
#> 2 Year_Built   15.4 
#> 3 Longitude     7.86
#> 4 Latitude      6.25
#> 5 Lot_Area      5.53
#> 6 Central_Air   4.05
#> 7 Neighborhood  3.97
#> 8 MS_Zoning     3.91

fit_xy(
  colino:::filter_mic,
  x = ames %>% select(all_of(num_col)),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#>   variable    score
#>   <chr>       <dbl>
#> 1 Longitude   0.463
#> 2 Gr_Liv_Area 0.441
#> 3 Year_Built  0.436
#> 4 Latitude    0.420
#> 5 Lot_Area    0.234

Created on 2023-07-10 with reprex v2.0.2

@stevenpawley
Copy link
Owner

Hi Max, many thanks for this - I wish my train rides were as productive! I've just started working through this - actually I wasn't aware of your desirability2 package - definitely will be looking at this particular for those MRMR type of cases.

Back to the filter/ranking methods - I can definitely can add these (and the remaining filter methods), so that the fit_xy generic can be used on any supplied filter, which maybe gets built into something like a step_filter_supervised for example.

Some other thoughts/ramblings are:

  1. How to specify / supply arguments to the methods in the same way as when called in their recipe steps, for example, mtry in a rf_imp filter? Most other ML libraries like sklearn or mlr3 allow tuning of almost everything, even it is creates some awkward syntax with those 'pipelinestep__modelname__parameter' sort of keys.
  2. There is also the idea of considering the choice of filtering method as a hyperparameter, i.e., choosing rf_imp vs. something else during tuning, but I guess that is a completely different issue and currently that could be performed via a workflowset (although more computationally expensive).
  3. How to reuse some of those components - currently each fit_xy method is essentially reimplementing each recipe step. I guess I should look at reversing this, so that each specific recipe step, like step_filter_infgain uses the fit_xy generic internally to avoid duplication.
  4. Trying to think about how many steps this would be applicable to? For example, are there methods that don't make sense to use with this approach, maybe for MRMR or Boruta? Or maybe that's fine - it is for the user to decide.

@topepo
Copy link
Contributor Author

topepo commented Jul 17, 2023

How to specify / supply arguments to the methods...

The underlying argument names are an open question for me. We can parameterize them for the individual filter functions. For a multi-method filter, I'm not sure the best way to specific them.

There is also the idea of considering the choice of filtering method as a hyperparameter,...

That's a great idea.

How to reuse some of those components...

I think that the prep() methods for the steps can call fit_xy().

Trying to think about how many steps this would be applicable to?...

I would try to do all of them (within reason).

I think that we can use this in different packages too. I plan on adding a recursive feature engineering function (maybe to finetune) and these would be useful.

@topepo
Copy link
Contributor Author

topepo commented Jul 17, 2023

I wish my train rides were as productive!

On the train ride back, I put the into a side-package that we could all use: https://github.com/topepo/filterdb

Some of these methods are based on what you first added so I planned on adding you as a contributor (if you want that).

I added some open questions/todo's in the package too. I'll convert these to issues this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants