- H2O now supports passing background data for model agnostic SHAP. This is now easier visible in {shapviz}, see h2oai/h2o-3#16463.
- H2O random forests (regression and binary classification) now support TreeSHAP as well #163.
- Adapt for upcoming {shapr} version, thanks @martinju for the fix #162.
- Fixed wrong link vignette #158.
sv_waterfall()
andsv_force()
: The x label has been changed from "SHAP value" to "Prediction".
- Add vignette for Tidymodels.
- Update vignettes.
- Update README.
- Support both XGBoost 1.x.x as well as XGBoost 2.x.x, implemented in #144.
- New argument
sort_features = TRUE
insv_importance()
andsv_interaction()
. Set toFALSE
to show the features as they appear in your SHAP matrix. In that case, the plots will show the firstmax_display
features, not the most important features. Implements #137.
shapviz.xgboost()
would fail if a single row is passed. This has been fixed in #142. Thanks @sebsilas for reporting.
If no SHAP interaction values are available, by default, the color feature v'
is selected by the heuristic potential_interaction()
, which works as follows:
- If the feature
v
(the on the x-axis) is numeric, it is binned intonbins
bins. - Per bin, the SHAP values of
v
are regressed ontov'
and the R-squared is calculated. Rows with missingv'
are discarded. - The R-squared are averaged over bins, weighted by the number of non-missing
v'
values.
This measures how much variability in the SHAP values of v
is explained by v'
, after accounting for v
.
We have introduced four parameters to control the heuristic. Their defaults are in line with the old behaviour.
-
nbin = NULL
: Into how many quantile bins should a numericv
be binned? The defaultNULL
equals the smaller of$n/20$ and$\sqrt n$ (rounded up), where$n$ is the sample size. -
color_num
Should color features be converted to numeric, even if they are factors/characters? Default isTRUE
. -
scale = FALSE
: Should R-squared be multiplied with the sample variance of within-bin SHAP values? IfTRUE
, bins with stronger vertical scatter will get higher weight. The default isFALSE
. -
adjusted = FALSE
: Should adjusted R-squared be calculated?
If SHAP interaction values are available, these parameters have no effect. In sv_dependence()
they are called ih_nbin
etc.
This partly implements the ideas in #119 of Roel Verbelen, thanks a lot for your patient explanations!
We will continue to experiment with the defaults, which might change in the future. A good alternative to the current (naive) defaults could be:
nbins = 7
: Smaller than now to not overfit too strongly with factor/character color features.color_num = FALSE
: To not naively integer encode factors/characters.scale = TRUE
: To account for non-equal spread in bins.adjusted = TRUE
: To not put too much weight on factors with many categories.
sv_dependence()
: Ifcolor_var = "auto"
(default) and no color feature seems to be relevant (SHAP interaction isNULL
, or heuristic returns no positive value), there won't be any color scale. Furthermore, in some edge cases, a different color feature might be selected.mshapviz()
objects can now be rowbinded viarbind()
or+
. Implemented by @jmaspons in #110.mshapviz()
is more strict when combining multiple "shapviz" objects. These now need to have identical column names, see #114.
- The README is shorter and easier.
- Updated vignettes.
print.shapviz()
now shows top two rows of SHAP matrix.- Re-activate all unit tests.
- Setting
nthread = 1
in all calls toxgb.DMatrix()
as suggested by @jmaspons in #109. - Added "How to contribute" to README.
permshap()
connector is now part of {kerneshap} #122.
sv_dependence2D()
: In caseadd_vars
are passed,x
and/ory
are removed from it in order to not use any variable twice. #116.split.shapviz()
now drops empty levels. They launched an error because empty "shapviz" objects are currently not supported. #117, #118
sv_importance()
of a "mshapviz" object now returns a dodged barplot instead of separate barplots via {patchwork}. Use the new argumentbar_type
to switch to a stacked barplot (bar_type = "stack"
), to "facets" (via {ggplot2}), or "separate" for the old behaviour.
- Added connector to permshap, a package calculating permutation SHAP values for regression and (probabilistic) classification.
- Revised vignette on "mshapviz".
- Commenting out most unit tests as they would not pass timings measured on Debian.
dimnames.shapviz()
has received a replacement method. You can thus change the column names of SHAP matrix and feature data (as well as SHAP interactions) bycolnames(x) <- ...
, see #98
- Fix for #100 (
package_version()
applied to numeric value will be deprecated in the future)
- New plot function
sv_dependence2D()
: x and y coordinates are two features, while their summed SHAP values are shown on the color scale. Ifinteraction = TRUE
, SHAP interaction values are shown on the color scale instead. The function is vectorized inx
and/ory
. This visualization is especially useful for models with geographic components. split(x, f)
splits a "shapviz" objectx
into a "mshapviz" object.
- Slight improvements in help/docu.
- New vignette on models with geographic components.
- Added a fantastic house price dataset with about 14,000 houses sold in Miami-Date County, thanks Steven C. Bourassa.
- "mshapviz" object created from multioutput "kernelshap" object retains names.
- For (upcoming) {fastshap} version >0.0.7,
fastshap::explain()
offers the optionshap_only
. To conveniently construct the "shapviz" object, useshapviz(fastshap::explain(..., shap_only = FALSE))
. This not only passes the SHAP matrix but also the feature data and the baseline. Thanks, Brandon Greenwell!
- Better help files
- Switched from "import ggplot2" to "ggplot2::function" code style
- Vignette "Multiple 'shapviz' objects": Fixed mistake in Random Forest + Kernel SHAP example
Sometimes, you will find it necessary to work with several "shapviz" objects at the same time:
- To visualize SHAP values of a multiclass or multi-output model.
- To compare SHAP plots of different models.
- To compare SHAP plots between subgroups.
To simplify the workflow, {shapviz} introduces the "mshapviz" object ("m" like "multi"). You can create it in different ways:
- Use
shapviz()
on multiclass XGBoost or LightGBM models. - Use
shapviz()
on "kernelshap" objects created from multiclass/multioutput models. - Use
c(Mod_1 = s1, Mod_2 = s2, ...)
on "shapviz" objectss1
,s2
, ... - Or
mshapviz(list(Mod_1 = s1, Mod_2 = s2, ...))
The sv_*()
functions use the {patchwork} package to glue the individual plots together.
See the new vignette for more info and specific examples.
sv_dependence()
now allows multiplev
and/orcolor_var
to be plotted (glued via {patchwork}).- {DALEX}: Support for "predict_parts" objects from {DALEX}, thanks to Adrian Stando.
- Aggregated SHAP values: The argument
row_id
ofsv_waterfall()
andsv_force()
now also allows a vector of integers or a logical vector. If more than one row is selected, SHAP values and predictions are averaged before plotting (aggregated SHAP values in {DALEX}). - Row bind: "shapviz" objects
x1
,x2
can now be concatenated in rowwise manner usingx1 + x2
orrbind(x1, x2)
, again thanks to Adrian. colnames()
: "shapviz" objectsx
have received adimnames()
function, so you can now, e.g., usecolnames(x)
to see the feature names.- Subsetting: "shapviz"
x
can now be subsetted usingx[cond, features]
.
- We have a new contributor: Adrian Stando - welcome on the SHAP board.
- To be close to my sister package {kernelshap}, I have moved to https://github.com/ModelOriented/shapviz
- Webpage created with "pgkdown"
- New dependency: {patchwork}
- Color guides are closer to the plot area. This affects
sv_dependence()
,sv_importance(kind="bee")
, andsv_interaction()
. - The lengthy y axis title "SHAP interaction value" in
sv_dependence()
has been shortened to "SHAP interaction". - As announced, the argument
show_other
ofsv_importance()
has been removed. - Slightly less picky checks on
S_inter
. print.shapviz()
is much more compact, usesummary.shapviz()
for more info.
sv_waterfall()
: Usingorder_fun()
would not work as expected withmax_display
. This has been fixed.sv_dependence()
: Passingviridis_args = NULL
would hide the color guide title. This has been fixed. But please passviridis_args = list()
instead.
sv_dependence()
now usescolor_var = "auto"
instead ofcolor_var = NULL
.sv_dependence()
now uses "SHAP value" as y label (instead of the more verbose "SHAP value of [feature]").
- Introduced API for SHAP interaction values
S_inter
(3D array):- Matrix method:
shapviz(object, ..., S_inter = NULL)
- XGBoost method:
shapviz(object, ..., interactions = TRUE)
- treeshap method:
shapviz(object, ...)
- Matrix method:
sv_interaction(x)
shows matrix of beeswarm plots.sv_dependence(x, v = "x1", color_var = "x2", interactions = TRUE)
plots SHAP interaction values.sv_dependence(x, v = "x1", interactions = TRUE)
plots pure main effects of "x1".- If SHAP interaction values are available,
sv_dependence(..., color_var = "auto")
uses those to determine the most interacting color variable. collapse_shap()
also works for SHAP interaction arrays.- SHAP interaction values can be extracted by
get_shap_interactions()
.
sv_importance()
: In case of too many features,sv_importance()
used to collapse the remaining features into an additional bar/beeswarm. This logic has been removed, and theshow_other
argument has been deprecated.- By default,
sv_dependence()
automatically adds horizontal jitter for discretev
. This now also works ifv
is numeric with at most seven unique values, not only for logicals, factors, and characterv
.
- "ggplot2" 3.4 has replaced the "size" aesthetic in line-based geoms by "linewidth". This has been adapted. "shapviz" now depends on ggplot2 >= 3.4.
sv_importance()
does not use a flipped coordinate system anymore.
- Hide "other":
sv_importance()
has received a new argumentshow_others = TRUE
. Set toFALSE
to hide the "other" bar/beeswarm.
The following dependencies have been removed:
- "ggbeeswarm"
- "vipor"
- "beeswarm"
- New argument
bee_width
: Relative width of the beeswarms. The default is 0.4. It replaces thewidth
argument passed via...
. - New argument
bee_adjust
: Relative adjustment factor of the bandwidth used in estimating the density of the beeswarms. Default is 0.5. - In case a beeswarm is shown: the
...
arguments are now passed togeom_point()
.
plotly::ggplotly()
now works for most functionalities ofsv_importance()
, including beeswarms.
- The argument
X
of the constructor ofshapviz()
is now less picky. If it contains columns not present in the SHAP matrix, they are silently dropped. Furthermore, the column order of the SHAP matrix andX
is now determined by the SHAP matrix.
- Functions
shapviz_from_lgb_predict()
andshapviz_from_xgb_predict()
format_fun
argument insv_force()
andsv_waterfall()
sort_fun
argument insv_waterfall()
collapse_shap()
is not anymore an S3 method. It is just a normal function that can be applied to a matrix.
- For R versions < 4.1,
sv_importance()
would return an error.
- kernelshap wrapper now also can deal with multioutput models.
- Added kernelshap wrapper.
- Removed unnecessary conversion of
X_pred
frommatrix
toxgb.DMatrix
inshapviz.xgb.Booster()
. - Vignette: Added a CatBoost wrapper to the vignette and changed the
treeshap()
example to aranger()
model.
- Fixed CRAN notes on html5.
- Added H2O wrapper.
- Added shapr wrapper.
- Added an optional
collapse
argument inshapviz()
. This is named list specifying which columns in the SHAP matrix are to be collapsed by rowwise summation. A typical application will be to combine the SHAP values of one-hot-encoded dummies and explain them by the corrsponding factor variable. - Major rework of
sv_importance()
, see next section.
The calculations behind sv_importance()
are unchanged, but defaults and some plot aspects have been reworked.
- Instead of a beeswarm plot,
sv_importance()
now shows a bar plot by default. Usekind = "beeswarm"
to get a beeswarm plot. - The bar plot of
sv_importance()
does not show SHAP feature importances as text anymore. Useshow_numbers = TRUE
to get them back. Furthermore, the numbers are now printed on top of the bars instead on their bottom. - The new argument
show_numbers
can be used to to add SHAP feature importance values for all plot types. - The default of
max_display
has been increased from 10 to 15. - The bar width has been reduced from 0.9 to 2/3 relative width. It can be controlled by the new argument
bar_width
. - The color bar title of the beeswarm plot can now be manually chosen by the new argument
color_bar_title
. Set toNULL
to remove the color bar altogether. - The argument
format_fun
now uses a right-aligned number formatter with aligned decimal separator by default.
- Added
dim()
method for "shapviz" object, implyingnrow()
andncol()
. - To allow more flexible formatting, the
format_fun
argument ofsv_waterfall()
andsv_force()
has been replaced byformat_shap
to format SHAP values andformat_feat
to format numeric feature values. By default, they use the new global options "shapviz.format_shap" and "shapviz.format_feat", both with defaultfunction(z) prettyNum(z, digits = 3, scientific = FALSE)
. sv_waterfall()
now uses the more consistent argumentorder_fun = function(s) order(abs(s))
instead of the originalsort_fun = function(shap) abs(shap)
that was then passed toorder()
.- Added argument
viridis_args = getOption("shapviz.viridis_args")
tosv_dependence()
andsv_importance()
to control the viridis color scale options. The default global option equalslist(begin = 0.25, end = 0.85, option = "inferno")
. For example, to switch to a standard viridis scale, you can either change the default withoptions(shapviz.viridis_args = NULL)
or setviridis_args = NULL
. - Deprecated helper functions
shapviz_from_lgb_predict()
andshapviz_from_xgb_predict
in favour of the collapsing logic (see above). The functions will be removed in version 0.3.0. - Added 'lightgbm' as "Enhances" dependency.
- Added 'h2o' as "Enhances" dependency.
- Anticipated changes in
predict()
arguments of LightGBM (data -> newdata, predcontrib = TRUE -> type = "contrib"). - More unit tests.
- Improved documentation.
- Fixed github installation instruction in README and vignette.
This is the initial CRAN release.