-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
organizing the filter/ranking methods #8
Comments
Here's a straw man constructor for new methods: new_filter_method <- function(name, label, goal = "maximize",
inputs = "all", outputs = "all", pkgs) {
# name: a keyword used in other steps (e.g. rf_imp or similar)
# label: for printing ("random forest variable importance")
goal <- rlang::arg_match0(goal, c("maximize", "minimize", "zero", "target"))
# Specifications for inputs and output variables
# Maybe these should be more specific (e.g. "factor", "numeric", etc).
# Should also specify max levels for factor inputs or outputs?
inputs <- rlang::arg_match0(inputs, c("all", "qualitative", "quantitative"))
outputs <- rlang::arg_match0(outputs, c("all", "qualitative", "quantitative"))
# pkgs: character string of external packages used to compute the filter
# maybe also set default arguments and a list that can't be altered by the user?
res <-
list(
name = name,
label = label,
goal = goal,
inputs = inputs,
outputs = outputs,
pkgs = pkgs
)
class(res) <- c(paste0("filter_method_", outputs), "filter_method")
res
}
|
I had a long train ride and did a draft implementation (in library(tidymodels)
library(colino) # remotes::install_github("topepo/colino") tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf) data(cells)
cells$case <- NULL
fit_xy(
colino:::filter_roc_auc,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 fiber_width_ch_1 0.833
#> 2 total_inten_ch_2 0.805
#> 3 total_inten_ch_1 0.790
#> 4 shape_p_2_a_ch_1 0.786
#> 5 avg_inten_ch_2 0.777
#> 6 convex_hull_area_ratio_ch_1 0.772
#> 7 avg_inten_ch_1 0.760
#> 8 entropy_inten_ch_1 0.759
#> 9 convex_hull_perim_ratio_ch_1 0.747
#> 10 var_inten_ch_1 0.727
#> # ℹ 46 more rows
fit_xy(
colino:::filter_mrmr,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 total_inten_ch_4 0.644
#> 2 entropy_inten_ch_1 -0.0736
#> 3 avg_inten_ch_2 -0.0740
#> 4 skew_inten_ch_4 -0.0754
#> 5 convex_hull_perim_ratio_ch_1 -0.0761
#> 6 shape_bfr_ch_1 -0.0764
#> 7 inten_cooc_contrast_ch_3 -0.0772
#> 8 eq_sphere_vol_ch_1 -0.0779
#> 9 spot_fiber_count_ch_4 -0.0783
#> 10 diff_inten_density_ch_1 -0.0801
#> # ℹ 46 more rows
fit_xy(
colino:::filter_info_gain,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 total_inten_ch_2 0.189
#> 2 fiber_width_ch_1 0.174
#> 3 avg_inten_ch_2 0.137
#> 4 shape_p_2_a_ch_1 0.130
#> 5 total_inten_ch_1 0.130
#> 6 convex_hull_area_ratio_ch_1 0.112
#> 7 avg_inten_ch_1 0.109
#> 8 entropy_inten_ch_1 0.103
#> 9 skew_inten_ch_1 0.0922
#> 10 convex_hull_perim_ratio_ch_1 0.0898
#> # ℹ 46 more rows
fit_xy(
colino:::filter_info_gain_ratio,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 total_inten_ch_2 0.158
#> 2 fiber_width_ch_1 0.126
#> 3 avg_inten_ch_2 0.106
#> 4 total_inten_ch_1 0.0982
#> 5 shape_p_2_a_ch_1 0.0978
#> 6 convex_hull_area_ratio_ch_1 0.0855
#> 7 avg_inten_ch_1 0.0828
#> 8 fiber_length_ch_1 0.0823
#> 9 skew_inten_ch_1 0.0767
#> 10 entropy_inten_ch_1 0.0754
#> # ℹ 46 more rows data(ames)
ames$Sale_Price <- log10(ames$Sale_Price)
num_col <- c("Longitude", "Latitude", "Year_Built", "Lot_Area", "Gr_Liv_Area")
fac_col <- c("MS_Zoning", "Central_Air", "Neighborhood")
fit_xy(
colino:::filter_corr,
x = ames %>% select(all_of(num_col)),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#> variable score
#> <chr> <dbl>
#> 1 Gr_Liv_Area 0.696
#> 2 Year_Built 0.615
#> 3 Longitude 0.292
#> 4 Latitude 0.286
#> 5 Lot_Area 0.255
fit_xy(
colino:::filter_max_diff,
x = ames %>% select(all_of(fac_col)),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 3 × 2
#> variable score
#> <chr> <dbl>
#> 1 MS_Zoning 0.814
#> 2 Neighborhood 0.531
#> 3 Central_Air 0.262
fit_xy(
colino:::filter_rf_imp,
x = ames %>% select(all_of(c(fac_col, num_col))),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 8 × 2
#> variable score
#> <chr> <dbl>
#> 1 Gr_Liv_Area 18.4
#> 2 Year_Built 15.4
#> 3 Longitude 7.86
#> 4 Latitude 6.25
#> 5 Lot_Area 5.53
#> 6 Central_Air 4.05
#> 7 Neighborhood 3.97
#> 8 MS_Zoning 3.91
fit_xy(
colino:::filter_mic,
x = ames %>% select(all_of(num_col)),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#> variable score
#> <chr> <dbl>
#> 1 Longitude 0.463
#> 2 Gr_Liv_Area 0.441
#> 3 Year_Built 0.436
#> 4 Latitude 0.420
#> 5 Lot_Area 0.234 Created on 2023-07-10 with reprex v2.0.2 |
Hi Max, many thanks for this - I wish my train rides were as productive! I've just started working through this - actually I wasn't aware of your desirability2 package - definitely will be looking at this particular for those MRMR type of cases. Back to the filter/ranking methods - I can definitely can add these (and the remaining filter methods), so that the Some other thoughts/ramblings are:
|
The underlying argument names are an open question for me. We can parameterize them for the individual filter functions. For a multi-method filter, I'm not sure the best way to specific them.
That's a great idea.
I think that the
I would try to do all of them (within reason). I think that we can use this in different packages too. I plan on adding a recursive feature engineering function (maybe to finetune) and these would be useful. |
On the train ride back, I put the into a side-package that we could all use: https://github.com/topepo/filterdb Some of these methods are based on what you first added so I planned on adding you as a contributor (if you want that). I added some open questions/todo's in the package too. I'll convert these to issues this week. |
Can we design some common infrastructure across filter methods? In other words the underlying filtering methods have parameters on the inputs (e.g. types of variables allowed) and the outputs (such as minimize/maximize), and so on? This is not unlike how yardstick organizes performance metrics.
One of the goals is to be able to make composite filters (e.g. maximize ROC AUC and pick the three largest importance scores). I have a private package that I've been kicking around for a while (ironically called colander - I'll send you an invite) that was a prototype for these types of filters.
If we had modular methods with control filter names, we could also reduce the total number of steps and have new ones that are flexible and work across all filter methods:
I also have some working code to use desirability functions:
The organizational parts in colander are not all that great right now but I think that the idea is a good one.
The text was updated successfully, but these errors were encountered: