Explainers¶

Simple example¶

In order to start an ExplainerDashboard you first need to construct an Explainer instance. They come in two flavours and at its most basic they only need a model, and a test set X and y:

from explainerdashboard import ClassifierExplainer, RegressionExplainer

explainer = ClassifierExplainer(model, X_test, y_test)
explainer = RegressionExplainer(model, X_test, y_test)

This is enough to launch an ExplainerDashboard:

from explainerdashboard import ExplainerDashboard
ExplainerDashboard(explainer).run()

Or you can use it interactively in a notebook to inspect your model using the built-in plotting methods, e.g.:

explainer.plot_confusion_matrix()
explainer.plot_contributions(index=0)
explainer.plot_dependence("Fare", color_col="Sex")

For the full lists of plots available see Plots.

Or you can start an interactive ExplainerComponent in your notebook using InlineExplainer, e.g.:

from explainerdashboard import InlineExplainer

InlineExplainer(explainer).tab.importances()
InlineExplainer(explainer).classifier.roc_auc()
InlineExplainer(explainer).regression.residuals_vs_col()
InlineExplainer(explainer).shap.overview()

Parameters¶

There are a number of optional parameters that can either make sure that SHAP values get calculated in the appropriate way, or that make the explainer give a bit nicer and more convenient output:

ClassifierExplainer(model, X_test, y_test,
        shap='linear', # manually set shap type, overrides default 'guess'
        X_background=X_train, # set background dataset for shap calculations
        model_output='logodds', # set model_output to logodds (vs probability)
        cats=['Sex', 'Deck', 'Embarked'], # makes it easy to group onehotencoded vars
        idxs=test_names, # index with str identifier
        index_name="Passenger", # description of index
        descriptions=feature_descriptions, # show long feature descriptions in hovers
        target='Survival', # the name of the target variable (y)
        precision='float32', # save memory by setting lower precision. Default is 'float64'
        labels=['Not survived', 'Survived']) # show target labels instead of ['0', '1']

cats¶

If you have onehot-encoded your categorical variables, they will show up as a lot of independent features. This clutters your feature space, and often makes it hard to interpret the effect of the underlying categorical feature.

You can pass a dict to the parameter cats specifying which are the onehotencoded columns, and what the grouped feature name should be:

ClassifierExplainer(model, X, y, cats={'Gender': ['Sex_male', 'Sex_female']})

However if you encoded your feature with pd.get_dummies(df, prefix=['Name']), then the resulting onehot encoded columns should be named ‘Name_John’, ‘Name_Mary’, Name_Bob’, etc. (or in general CategoricalFeature_Category), then you can simply pass a list of the prefixes to cats:

ClassifierExplainer(model, X, y, cats=['Sex', 'Deck', 'Embarked'])

And you can also combine the two methods:

ClassifierExplainer(model, X, y,
    cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'])

You can now use these categorical features directly as input for plotting methods, e.g. explainer.plot_dependence("Deck"), which will now generate violin plots instead of the default scatter plots.

cats_notencoded¶

When you have onehotencoded a categorical feature, you may have dropped some columns during feature selection. Or there are new categories in the test set that were not encoded as columns in the training set. In that cases all columns in your onehot encoding may be equal to 0 for some rows. By default the value assigned to the aggregated feature for such cases is 'NOT_ENCODED', but this can be overriden with the cats_notencoded parameter:

ClassifierExplainer(model, X, y,
    cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'],
    cats_notencoded={'Gender': 'Gender Other', 'Deck': 'Unknown Deck', 'Embarked':'Stowaway'})

idxs¶

You may have specific identifiers (names, customer id’s, etc) for each row in your dataset. By default X.index will get used to identify individual rows/records in the dashboard. And you can index using both the numerical index, e.g. explainer.get_contrib_df(0) for the first row, or using the identifier, e.g. explainer.get_contrib_df("Braund, Mr. Owen Harris").

You can override using X.index by passing a list/array/Series idxs to the explainer:

from explainerdashboard.datasets import titanic_names

test_names = titanic_names(test_only=True)
ClassifierExplainer(model, X_test, y_test, idxs=test_names)

index_name¶

By default X.index.name or idxs.name is used as the description of the index, but you can also pass it explicitly, e.g.: index_name="Passenger".

descriptions¶

descriptions can be passed as a dictionary of descriptions for each feature. In order to be explanatory, you often have to explain the meaning of the features themselves (especially if the naming is not obvious). Passing the dict along to descriptions will show hover-over tooltips for the various features in the dashboard. If you grouped onehotencoded features with the cats parameter, you can also give descriptions of these groups, e.g:

ClassifierExplainer(model, X, y,
    cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'],
    descriptions={
        'Gender': 'Gender of the passenger',
        'Fare': 'The price of the ticket paid for by the passenger',
        'Deck': 'The deck of the cabin of the passenger',
        'Age': 'Age of the passenger in year'
    })

target¶

Name of the target variable. By default the name of the y (y.name) is used if y is a pd.Series, else it defaults to 'target', bu this can be overriden:

ClassifierExplainer(model, X, y, target="Survival")

labels¶

For ClassifierExplainer only: The outcome variables for a classification y are assumed to be encoded 0, 1 (, 2, 3, ...) You can assign string labels by passing e.g. labels=['Not survived', 'Survived']:

ClassifierExplainer(model, X, y, labels=['Not survived', 'Survived'])

units¶

For RegressionExplainer only: the units of the y variable. E.g. if the model is predicting house prices in dollars you can set units='$'. If it is predicting maintenance time you can set units='hours', etc. This will then be displayed along the axis of various plots:

RegressionExplainer(model, X, y, units="$")

X_background¶

Some models like sklearn LogisticRegression (as well as certain gradient boosting algorithms such as xgboost in probability space) need a background dataset to calculate shap values. These can be passed as X_background. If you don’t pass an X_background, Explainer uses X instead but gives off a warning. (You want to limit the size of X_background in order to keep the SHAP calculations from getting too slow. Usually a representative background dataset of a couple of hunderd rows should be enough to get decent shap values.)

model_output¶

By default model_output for classifiers is set to "probability", as this is more intuitively explainable to non data scientist stakeholders. However certain models (e.g. XGBClassifier, LGBMCLassifier, CatBoostClassifier), need a background dataset X_background to calculate SHAP values in probability space, and are not able to calculate shap interaction values in probability space at all. Therefore you can also pass model_output=’logodds’, in which case shap values get calculated faster and interaction effects can be studied. Now you just need to explain to your stakeholders what logodds are :)

shap¶

By default shap='guess', which means that the Explainer will try to guess based on the model what kind of shap explainer it needs: e.g. shap.TreeExplainer(...), shap.LinearExplainer(...), etc.

In case the guess fails or you’d like to override it, you can set it manually: e.g. shap='tree' for shap.TreeExplainer, shap='linear' for shap.LinearExplainer, shap='kernel' for shap.KernelExplainer, shap='deep' for shap.DeepExplainer, etc.

model_output, X_background example¶

An example of using setting X_background and model_output with a LogisticRegression:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

explainer = ClassifierExplainer(model, X_test, y_test,
                                    shap='linear',
                                    X_background=X_train,
                                    model_output='logodds')
ExplainerDashboard(explainer).run()

cv¶

Normally metrics and permutation importances get calculated over a single fold (assuming the data X is the test set). However if you pass the training set to the explainer, you may wish to cross-validate calculate the permutation importances and metrics. In that case pass the number of folds to cv. Note that custom metrics do not work with cross validation for now.

na_fill¶

If you fill missing values with some extreme value such as -999 (typical for tree based methods), these can mess with the horizontal axis of your plots. In order to filter these out, you need to tell the explainer what the extreme value is that you used to fill. Defaults to -999.

precision¶

You can set the precision of the calculated shap values, predictions, etc, in order to save on memory usage. Default is 'float64', but 'float32' is probably fine, maybe even 'float16' for your application.

Pre-calculated shap values¶

Perhaps you already have calculated the shap values somewhere, or you can calculate them off on a giant cluster somewhere, or your model supports GPU generated shap values.

You can simply add these pre-calculated shap values to the explainer with explainer.set_shap_values() and explainer.set_shap_interaction_values() methods.

Plots¶

Shared Plots¶

The abstract base class BaseExplainer defines most of the functionality such as feature importances (both SHAP and permutation based), SHAP values, SHAP interaction values partial dependences, individual contributions, etc. Along with a number of convenient plotting methods. In practice you will use ClassifierExplainer or RegressionExplainer, however they both inherit all of these basic methods:

plot_importances(kind='shap', topx=None, round=3, pos_label=None)
plot_contributions(index, topx=None, cutoff=None, round=2, pos_label=None)
plot_importances_detailed(topx=None, pos_label=None)
plot_interactions_detailed(col, topx=None, pos_label=None)
plot_dependence(col, color_col=None, highlight_idx=None, pos_label=None)
plot_interaction(interact_col, highlight_idx=None, pos_label=None)
plot_pdp(col, index=None, drop_na=True, sample=100, num_grid_lines=100, num_grid_points=10, pos_label=None)

example code:

explainer = ClassifierExplainer(model, X, y, cats=['Sex', 'Deck', 'Embarked'])
explainer.plot_importances()
explainer.plot_contributions(index=0, topx=5)
explainer.plot_dependence("Fare")
explainer.plot_interaction("Embarked", "PassengerClass")
explainer.plot_pdp("Sex", index=0)

plot_importances¶

BaseExplainer.plot_importances(kind='shap', topx=None, round=3, pos_label=None)¶

plot barchart of importances in descending order.

Parameters

type (str, optional) – shap’ for mean absolute shap values, ‘permutation’ for permutation importances, defaults to ‘shap’
topx (int, optional, optional) – Only return topx features, defaults to None
kind – (Default value = ‘shap’)
round – (Default value = 3)
pos_label – (Default value = None)

Returns

fig

Return type

plotly.fig

plot_importances_detailed¶

BaseExplainer.plot_importances_detailed(highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)¶

Plot barchart of mean absolute shap value.

Displays all individual shap value for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters

highlight_index (str or int) – index to highlight
topx (int, optional) – Only display topx most important features, defaults to None
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
pos_label – positive class (Default value = None)

Returns

plotly.Fig

plot_contributions¶

BaseExplainer.plot_contributions(index=None, X_row=None, topx=None, cutoff=None, sort='abs', orientation='vertical', higher_is_better=True, round=2, pos_label=None)¶

plot waterfall plot of shap value contributions to the model prediction for index.

Parameters

index (int or str) – index for which to display prediction
X_row (pd.DataFrame single row) – a single row of a features to plot shap contributions for. Can use this instead of index for what-if scenarios.
topx (int, optional, optional) – Only display topx features, defaults to None
cutoff (float, optional, optional) – Only display features with at least cutoff contribution, defaults to None
sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, or low to high, or by order of shap feature importance. Defaults to ‘abs’.
orientation ({'vertical', 'horizontal'}) – Horizontal or vertical bar chart. Horizontal may be better if you have lots of features. Defaults to ‘vertical’.
higher_is_better (bool) – if True, up=green, down=red. If false reversed. Defaults to True.
round (int, optional, optional) – round contributions to round precision, defaults to 2
pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

plot_dependence¶

BaseExplainer.plot_dependence(col, color_col=None, highlight_index=None, topx=None, sort='alphabet', max_cat_colors=5, round=3, plot_sample=None, remove_outliers=False, pos_label=None)¶

plot shap dependence

Plots a shap dependence plot:

on the x axis the possible values of the feature col
on the y axis the associated individual shap values

Parameters

col (str) – feature to be displayed
color_col (str) – if color_col provided then shap values colored (blue-red) according to feature color_col (Default value = None)
highlight_index – individual observation to be highlighed in the plot. (Default value = None)
topx (int, optional) – for categorical features only display topx categories.
sort (str) – for categorical features, how to sort the categories: alphabetically ‘alphabet’, most frequent first ‘freq’, highest mean absolute value first ‘shap’. Defaults to ‘alphabet’.
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
round (int, optional) – rounding to apply to floats. Defaults to 3.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.
pos_label – positive class (Default value = None)

Returns:

plot_interaction¶

BaseExplainer.plot_interaction(col, interact_col, highlight_index=None, topx=10, sort='alphabet', max_cat_colors=5, plot_sample=None, remove_outliers=False, pos_label=None)¶

plots a dependence plot for shap interaction effects

Parameters

col (str) – feature for which to find interaction values
interact_col (str) – feature for which interaction value are displayed
highlight_index (str, optional) – index that will be highlighted, defaults to None
topx (int, optional) – number of categorical features to display in violin plots.
sort (str, optional) – how to sort categorical features in violin plots. Should be in {‘alphabet’, ‘freq’, ‘shap’}.
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.
pos_label – (Default value = None)

Returns

Plotly Fig

Return type

plotly.Fig

plot_pdp¶

BaseExplainer.plot_pdp(col, index=None, X_row=None, drop_na=True, sample=100, gridlines=100, gridpoints=10, sort='freq', round=2, pos_label=None)¶

plot partial dependence plot (pdp)

returns plotly fig for a partial dependence plot showing ice lines for num_grid_lines rows, average pdp based on sample of sample. If index is given, display pdp for this specific index.

Parameters

col (str) – feature to display pdp graph for
index (int or str, optional, optional) – index to highlight in pdp graph, defaults to None
X_row (pd.Dataframe, single row, optional) – a row of features to highlight predictions for. Alternative to passing index.
drop_na (bool, optional, optional) – if true drop samples with value equal to na_fill, defaults to True
sample (int, optional, optional) – sample size on which the average pdp will be calculated, defaults to 100
gridlines (int, optional) – number of ice lines to display, defaults to 100
gridpoints(ints – int, optional): number of points on the x axis to calculate the pdp for, defaults to 10
sort (str, optional) – For categorical features: how to sort: ‘alphabet’, ‘freq’, ‘shap’. Defaults to ‘freq’.
round (int, optional) – round float prediction to number of digits. Defaults to 2.
pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

plot_interactions_importance¶

BaseExplainer.plot_interactions_importance(col, topx=None, pos_label=None)¶

plot mean absolute shap interaction value for col.

Parameters

col – column for which to generate shap interaction value
topx (int, optional, optional) – Only return topx features, defaults to None
pos_label – (Default value = None)

Returns

fig

Return type

plotly.fig

plot_interactions_detailed¶

BaseExplainer.plot_interactions_detailed(col, highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)¶

Plot barchart of mean absolute shap interaction values

Displays all individual shap interaction values for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters

col (type]) – feature for which to show interactions summary
highlight_index (str or int) – index to highlight
topx (int, optional) – only show topx most important features, defaults to None
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
pos_label – positive class (Default value = None)

Returns

fig

Classifier Plots¶

ClassifierExplainer defines a number of additional plotting methods:

plot_precision(bin_size=None, quantiles=None, cutoff=None, multiclass=False, pos_label=None)
plot_cumulative_precision(pos_label=None)
plot_classification(cutoff=0.5, percentage=True, pos_label=None)
plot_confusion_matrix(cutoff=0.5, normalized=False, binary=False, pos_label=None)
plot_lift_curve(cutoff=None, percentage=False, round=2, pos_label=None)
plot_roc_auc(cutoff=0.5, pos_label=None)
plot_pr_auc(cutoff=0.5, pos_label=None)

example code:

explainer = ClassifierExplainer(model, X, y, labels=['Not Survived', 'Survived'])
explainer.plot_confusion_matrix(cutoff=0.6)
explainer.plot_precision(quantiles=10, cutoff=0.6, multiclass=True)
explainer.plot_lift_curve(percentage=True)
explainer.plot_roc_auc(cutoff=0.7)
explainer.plot_pr_auc(cutoff=0.3)

More examples in the notebook on the github repo.

plot_precision¶

ClassifierExplainer.plot_precision(bin_size=None, quantiles=None, cutoff=None, multiclass=False, pos_label=None)¶

plot precision vs predicted probability

plots predicted probability on the x-axis and observed precision (fraction of actual positive cases) on the y-axis.

Should pass either bin_size fraction of number of quantiles, but not both.

Parameters

bin_size (float, optional) – size of the bins on x-axis (e.g. 0.05 for 20 bins)
quantiles (int, optional) – number of equal sized quantiles to split the predictions by e.g. 20, optional)
cutoff – cutoff of model to include in the plot (Default value = None)
multiclass – whether to display all classes or only positive class, defaults to False
pos_label – positive label to display, defaults to self.pos_label

Returns

Plotly fig

plot_cumulative_precision¶

ClassifierExplainer.plot_cumulative_precision(percentile=None, pos_label=None)¶

plot cumulative precision

returns a cumulative precision plot, which is a slightly different representation of a lift curve.

Parameters: pos_label – positive label to display, defaults to self.pos_label
Returns: plotly fig

plot_classification¶

ClassifierExplainer.plot_classification(cutoff=0.5, percentage=True, pos_label=None)¶

plot showing a barchart of the classification result for cutoff

Parameters

cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = 0.5)
percentage (bool, optional) – display percentages instead of counts, defaults to True
pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_confusion_matrix¶

ClassifierExplainer.plot_confusion_matrix(cutoff=0.5, percentage=False, normalize='all', binary=False, pos_label=None)¶

plot of a confusion matrix.

Parameters

cutoff (float, optional, optional) – cutoff of positive class to calculate confusion matrix for, defaults to 0.5
percentage (bool, optional, optional) – display percentages instead of counts , defaults to False
normalize (str[‘observed’, ‘pred’, ‘all’]) – normalizes confusion matrix over the observed (rows), predicted (columns) conditions or all the population. Defaults to all.
binary (bool, optional, optional) – if multiclass display one-vs-rest instead, defaults to False
pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_lift_curve¶

ClassifierExplainer.plot_lift_curve(cutoff=None, percentage=False, add_wizard=True, round=2, pos_label=None)¶

plot of a lift curve.

Parameters

cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = None)
percentage (bool, optional) – display percentages instead of counts, defaults to False
add_wizard (bool, optional) – Add a line indicating how a perfect model would perform (“the wizard”). Defaults to True.
round – number of digits to round to (Default value = 2)
pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_roc_auc¶

ClassifierExplainer.plot_roc_auc(cutoff=0.5, pos_label=None)¶

plots ROC_AUC curve.

The TPR and FPR of a particular cutoff is displayed in crosshairs.

Parameters

cutoff – cutoff value to be included in plot (Default value = 0.5)
pos_label – (Default value = None)

Returns:

plot_pr_auc¶

ClassifierExplainer.plot_pr_auc(cutoff=0.5, pos_label=None)¶

plots PR_AUC curve.

the precision and recall of particular cutoff is displayed in crosshairs.

Parameters

cutoff – cutoff value to be included in plot (Default value = 0.5)
pos_label – (Default value = None)

Returns:

Regression Plots¶

For the derived RegressionExplainer class again some additional plots:

explainer.plot_predicted_vs_actual(...)
explainer.plot_residuals(...)
explainer.plot_residuals_vs_feature(...)

plot_predicted_vs_actual¶

RegressionExplainer.plot_predicted_vs_actual(round=2, logs=False, log_x=False, log_y=False, plot_sample=None, **kwargs)¶

plot with predicted value on x-axis and actual value on y axis.

Parameters

round (int, optional) – rounding to apply to outcome, defaults to 2
logs (bool, optional) – log both x and y axis, defaults to False
log_y (bool, optional) – only log x axis. Defaults to False.
log_x (bool, optional) – only log y axis. Defaults to False.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
**kwargs –

Returns

Plotly fig

plot_residuals¶

RegressionExplainer.plot_residuals(vs_actual=False, round=2, residuals='difference', plot_sample=None)¶

plot of residuals. x-axis is the predicted outcome by default

Parameters

vs_actual (bool, optional) – use actual value for x-axis, defaults to False
round (int, optional) – rounding to perform on values, defaults to 2
residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

Plotly fig

plot_residuals_vs_feature¶

RegressionExplainer.plot_residuals_vs_feature(col, residuals='difference', round=2, dropna=True, points=True, winsor=0, topx=None, sort='alphabet', plot_sample=None)¶

Plot residuals vs individual features

Parameters

col (str) – Plot against feature col
residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.
round (int, optional) – rounding to perform on residuals, defaults to 2
dropna (bool, optional) – drop missing values from plot, defaults to True.
points (bool, optional) – display point cloud next to violin plot. Defaults to True.
winsor (int, 0-50, optional) – percentage of outliers to winsor out of the y-axis. Defaults to 0.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

plotly fig

DecisionTree Plots¶

There are additional mixin classes specifically for sklearn RandomForests and for xgboost models that define additional methods and plots to investigate and visualize individual decision trees within the ensemblke. These uses the dtreeviz library to visualize individual decision trees.

You can get a pd.DataFrame summary of the path that a specific index row took through a specific decision tree. You can also plot the individual predictions of each individual tree for specific row in your data indentified by index:

explainer.get_decisionpath_df(tree_idx, index)
explainer.get_decisionpath_summary_df(tree_idx, index)
explainer.plot_trees(index)

And for dtreeviz visualization of individual decision trees (svg format):

explainer.decisiontree(tree_idx, index)
explainer.decisiontree_file(tree_idx, index)
explainer.decisiontree_encoded(tree_idx, index)

These methods are part of the RandomForestExplainer and XGBExplainer`` mixin classes that get automatically loaded when you pass either a RandomForest or XGBoost model.

plot_trees¶

RandomForestExplainer.plot_trees(index, highlight_tree=None, round=2, higher_is_better=True, pos_label=None)¶

plot barchart predictions of each individual prediction tree

Parameters

index – index to display predictions for
highlight_tree – tree to highlight in plot (Default value = None)
round – rounding of numbers in plot (Default value = 2)
higher_is_better (bool) – flip red and green. Dummy bool for compatibility with gbm plot_trees().
pos_label – positive class (Default value = None)

Returns:

decisiontree¶

RandomForestExplainer.decisiontree(tree_idx, index, show_just_path=False)¶

get a dtreeviz visualization of a particular tree in the random forest.

Parameters

tree_idx – the n’th tree in the random forest
index – row index
show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a IPython display SVG object for e.g. jupyter notebook.

decisiontree_file¶

RandomForestExplainer.decisiontree_file(tree_idx, index, show_just_path=False)¶

decisiontree_encoded¶

RandomForestExplainer.decisiontree_encoded(tree_idx, index, show_just_path=False)¶

get a dtreeviz visualization of a particular tree in the random forest.

Parameters

tree_idx – the n’th tree in the random forest
index – row index
show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a base64 encoded image, for inclusion in websites (e.g. dashboard)

Other explainer outputs¶

Base outputs¶

Some other useful tables and outputs you can get out of the explainer:

metrics()
get_mean_abs_shap_df(topx=None, cutoff=None, cats=False, pos_label=None)
get_permutation_importances_df(topx=None, cutoff=None, cats=False, pos_label=None)
get_importances_df(kind="shap", topx=None, cutoff=None, cats=False, pos_label=None)
get_contrib_df(index, cats=True, topx=None, cutoff=None, pos_label=None)
get_contrib_summary_df(index, cats=True, topx=None, cutoff=None, round=2, pos_label=None)
get_interactions_df(col, cats=False, topx=None, cutoff=None, pos_label=None)

metrics¶

BaseExplainer.metrics(*args, **kwargs)¶

returns a dict of metrics.

Implemented by either ClassifierExplainer or RegressionExplainer

metrics_descriptions¶

ClassifierExplainer.metrics_descriptions(cutoff=0.5, round=3, pos_label=None)¶

Returns a metrics dict with the value replaced with a description/interpretation of the value

Parameters

cutoff (float, optional) – Cutoff for calculating the metrics. Defaults to 0.5.
round (int, optional) – Round to apply to floats. Defaults to 3.
pos_label (None, optional) – positive label. Defaults to None.

Returns

dict

RegressionExplainer.metrics_descriptions(round=2)¶

Returns a metrics dict, with the metric values replaced by a descriptive string, explaining/interpreting the value of the metric

Returns: dict

get_mean_abs_shap_df¶

BaseExplainer.get_mean_abs_shap_df(topx=None, cutoff=None, pos_label=None)¶

sorted dataframe with mean_abs_shap

returns a pd.DataFrame with the mean absolute shap values per features, sorted rom highest to lowest.

Parameters

topx (int, optional, optional) – Only return topx most importance features, defaults to None
cutoff (float, optional, optional) – Only return features with mean abs shap of at least cutoff, defaults to None
pos_label – (Default value = None)

Returns

shap_df

Return type

pd.DataFrame

get_permutation_importances_df¶

BaseExplainer.get_permutation_importances_df(topx=None, cutoff=None, pos_label=None)¶

dataframe with features ordered by permutation importance.

For more about permutation importances.

see https://explained.ai/rf-importance/index.html

Parameters

topx (int, optional, optional) – only return topx most important features, defaults to None
cutoff (float, optional, optional) – only return features with importance of at least cutoff, defaults to None
pos_label – (Default value = None)

Returns

importance_df

Return type

pd.DataFrame

get_importances_df¶

BaseExplainer.get_importances_df(kind='shap', topx=None, cutoff=None, pos_label=None)¶

wrapper function for get_mean_abs_shap_df() and get_permutation_importance_df()

Parameters

kind (str) – ‘shap’ or ‘permutations’ (Default value = “shap”)
topx – only display topx highest features (Default value = None)
cutoff – only display features above cutoff (Default value = None)
pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

get_contrib_df¶

BaseExplainer.get_contrib_df(index=None, X_row=None, topx=None, cutoff=None, sort='abs', pos_label=None)¶

shap value contributions to the prediction for index.

Used as input for the plot_contributions() method.

Parameters

index (int or str) – index for which to calculate contributions
X_row (pd.DataFrame, single row) – single row of feature for which to calculate contrib_df. Can us this instead of index
topx (int, optional) – Only return topx features, remainder called REST, defaults to None
cutoff (float, optional) – only return features with at least cutoff contributions, defaults to None
sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, low to high, or ordered by the global shap importances. Defaults to ‘abs’.
pos_label – (Default value = None)

Returns

contrib_df

Return type

pd.DataFrame

get_contrib_summary_df¶

BaseExplainer.get_contrib_summary_df(index=None, X_row=None, topx=None, cutoff=None, round=2, sort='abs', pos_label=None)¶

Takes a contrib_df, and formats it to a more human readable format

Parameters

index – index to show contrib_summary_df for
X_row (pd.DataFrame, single row) – single row of feature for which to calculate contrib_df. Can us this instead of index
topx – Only show topx highest features(Default value = None)
cutoff – Only show features above cutoff (Default value = None)
round – round figures (Default value = 2)
sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, or low to high, or ordered by the global shap importances. Defaults to ‘abs’.
pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

get_interactions_df¶

BaseExplainer.get_interactions_df(col, topx=None, cutoff=None, pos_label=None)¶

dataframe of mean absolute shap interaction values for col

Parameters

col – Feature to get interactions_df for
topx – Only display topx most important features (Default value = None)
cutoff – Only display features with mean abs shap of at least cutoff (Default value = None)
pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

Classifier outputs¶

For ClassifierExplainer in addition:

random_index(y_values=None, return_str=False,pred_proba_min=None, pred_proba_max=None,
                pred_percentile_min=None, pred_percentile_max=None, pos_label=None)
prediction_result_df(index, pos_label=None)
cutoff_from_percentile(percentile, pos_label=None)
get_precision_df(bin_size=None, quantiles=None, multiclass=False, round=3, pos_label=None)
get_liftcurve_df(pos_label=None)

random_index¶

ClassifierExplainer.random_index(y_values=None, return_str=False, pred_proba_min=None, pred_proba_max=None, pred_percentile_min=None, pred_percentile_max=None, pos_label=None)¶

random index satisfying various constraint

Parameters

y_values – list of labels to include (Default value = None)
return_str – return str from self.idxs (Default value = False)
pred_proba_min – minimum pred_proba (Default value = None)
pred_proba_max – maximum pred_proba (Default value = None)
pred_percentile_min – minimum pred_proba percentile (Default value = None)
pred_percentile_max – maximum pred_proba percentile (Default value = None)
pos_label – positive class (Default value = None)

Returns

index

cutoff_from_percentile¶

ClassifierExplainer.cutoff_from_percentile(percentile, pos_label=None)¶

The cutoff equivalent to the percentile given

For example if you want the cutoff that splits the highest 20% pred_proba from the lowest 80%, you would set percentile=0.8 and get the correct cutoff.

Parameters

percentile (float) – percentile to convert to cutoff
pos_label – positive class (Default value = None)

Returns

cutoff

percentile_from_cutoff¶

ClassifierExplainer.percentile_from_cutoff(cutoff, pos_label=None)¶

The percentile equivalent to the cutoff given

For example if set the cutoff at 0.8, then what percentage of pred_proba is above this cutoff?

Parameters

cutoff (float) – cutoff to convert to percentile
pos_label – positive class (Default value = None)

Returns

percentile

get_precision_df¶

ClassifierExplainer.get_precision_df(bin_size=None, quantiles=None, multiclass=False, round=3, pos_label=None)¶

dataframe with predicted probabilities and precision

Parameters

bin_size (float, optional, optional) – group predictions in bins of size bin_size, defaults to 0.1
quantiles (int, optional, optional) – group predictions in evenly sized quantiles of size quantiles, defaults to None
multiclass (bool, optional, optional) – whether to calculate precision for every class (Default value = False)
round – (Default value = 3)
pos_label – (Default value = None)

Returns

precision_df

Return type

pd.DataFrame

get_liftcurve_df¶

ClassifierExplainer.get_liftcurve_df(pos_label=None)¶

returns a pd.DataFrame with data needed to build a lift curve

Parameters: pos_label – (Default value = None)

Returns:

get_classification_df¶

ClassifierExplainer.get_classification_df(cutoff=0.5, pos_label=None)¶

Returns a dataframe with number of observations in each class above and below the cutoff.

Parameters

cutoff (float, optional) – Cutoff to split on. Defaults to 0.5.
pos_label (int, optional) – Pos label to generate dataframe for. Defaults to self.pos_label.

Returns

pd.DataFrame

roc_auc_curve¶

ClassifierExplainer.roc_auc_curve(pos_label=None)¶: Returns a dict with output from sklearn.metrics.roc_curve() for pos_label: fpr, tpr, thresholds, score

pr_auc_curve¶

ClassifierExplainer.pr_auc_curve(pos_label=None)¶: Returns a dict with output from sklearn.metrics.precision_recall_curve() for pos_label: fpr, tpr, thresholds, score

confusion_matrix¶

ClassifierExplainer.confusion_matrix(cutoff=0.5, binary=True, pos_label=None)¶

Regression outputs¶

For RegressionExplainer:

random_index(y_min=None, y_max=None, pred_min=None, pred_max=None,
                residuals_min=None, residuals_max=None,
                abs_residuals_min=None, abs_residuals_max=None,
                return_str=False)

random_index¶

RegressionExplainer.random_index(y_min=None, y_max=None, pred_min=None, pred_max=None, residuals_min=None, residuals_max=None, abs_residuals_min=None, abs_residuals_max=None, return_str=False, **kwargs)¶

random index following to various exclusion criteria

Parameters

y_min – (Default value = None)
y_max – (Default value = None)
pred_min – (Default value = None)
pred_max – (Default value = None)
residuals_min – (Default value = None)
residuals_max – (Default value = None)
abs_residuals_min – (Default value = None)
abs_residuals_max – (Default value = None)
return_str – return the str index from self.idxs (Default value = False)
**kwargs –

Returns

a random index that fits the exclusion criteria

RandomForest and XGBoost outputs¶

For RandomForest and XGBoost models mixin classes that visualize individual decision trees will be loaded: RandomForestExplainer and XGBExplainer with the following additional methods:

decisiontree_df(tree_idx, index, pos_label=None)
decisiontree_summary_df(tree_idx, index, round=2, pos_label=None)
decision_path_file(tree_idx, index)
decision_path_encoded(tree_idx, index)
decision_path(tree_idx, index)

get_decisionpath_df¶

RandomForestExplainer.get_decisionpath_df(tree_idx, index, pos_label=None)¶

dataframe with all decision nodes of a particular decision tree for a particular observation.

Parameters

tree_idx – the n’th tree in the random forest
index – row index
pos_label – positive class (Default value = None)

Returns

dataframe with summary of the decision tree path

get_decisionpath_summary_df¶

RandomForestExplainer.get_decisionpath_summary_df(tree_idx, index, round=2, pos_label=None)¶

formats decisiontree_df in a slightly more human readable format.

Parameters

tree_idx – the n’th tree in the random forest or boosted ensemble
index – index
round – rounding to apply to floats (Default value = 2)
pos_label – positive class (Default value = None)

Returns

dataframe with summary of the decision tree path

decisiontree_file¶

RandomForestExplainer.decisiontree_file(tree_idx, index, show_just_path=False)¶

decisiontree_encoded¶

RandomForestExplainer.decisiontree_encoded(tree_idx, index, show_just_path=False)¶

get a dtreeviz visualization of a particular tree in the random forest.

Parameters

tree_idx – the n’th tree in the random forest
index – row index
show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a base64 encoded image, for inclusion in websites (e.g. dashboard)

decisiontree¶

RandomForestExplainer.decisiontree(tree_idx, index, show_just_path=False)¶

get a dtreeviz visualization of a particular tree in the random forest.

Parameters

tree_idx – the n’th tree in the random forest
index – row index
show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a IPython display SVG object for e.g. jupyter notebook.

Calculated Properties¶

In general Explainers don’t calculate any properties of the model or the data until they are needed for an output, so-called lazy calculation. When the property is calculated once, it is stored for next time. So the first time you invoke a plot involving shap values may take a while to calculate. The next time will be basically instant.

You can access these properties directly from the explainer, e.g. explainer.get_shap_values_df(). For classifier models if you want values for a particular pos_label you can pass this label explainer.get_shap_values_df(0) would get the shap values for the 0’th class label.

In order to calculate all properties of the explainer at once, you can call explainer.calculate_properties(). (ExplainerComponents have a similar method component.calculate_dependencies() to calculate all properties that that specific component will need).

The various properties are:

explainer.preds
explainer.pred_percentiles
explainer.permutation_importances(pos_label)
explainer.mean_abs_shap_df(pos_label)
explainer.shap_base_value(pos_label)
explainer.get_shap_values_df(pos_label)
explainer.shap_interaction_values

For ClassifierExplainer:

explainer.y_binary
explainer.pred_probas_raw
explainer.pred_percentiles_raw
explainer.pred_probas(pos_label)
explainer.roc_auc_curve(pos_label)
explainer.pr_auc_curve(pos_label)
explainer.get_classification_df(cutoff, pos_label)
explainer.get_liftcurve_df(pos_label)
explainer.confusion_matrix(cutoff, binary, pos_label)

For RegressionExplainer:

explainer.residuals
explainer.abs_residuals

Setting pos_label¶

For ClassifierExplainer you can calculate most properties for multiple labels as the positive label. With a binary classification usually label ‘1’ is the positive class, but in some cases you might also be interested in the ‘0’ label.

For multiclass classification you may want to investigate shap dependences for the various classes.

You can pass a parameter pos_label to almost every property or method, to get the output for that specific positive label. If you don’t pass a pos_label manually to a specific method, the global self.pos_label will be used. You can set this directly on the explainer (even us str labels if you have set these):

explainer.pos_label = 0
explainer.plot_dependence("Fare") # will show plot for pos_label=0
explainer.pos_label = 'Survived'
explainer.plot_dependence("Fare") # will now show plot for pos_label=1
explainer.plot_dependence("Fare", pos_label=0) # show plot for label 0, without changing explainer.pos_label

The ExplainerDashboard will show a dropdown menu in the header to choose a particular pos_label. Changing this will basically update every single plot in the dashboard.

BaseExplainer¶

class explainerdashboard.explainers.BaseExplainer(model, X, y=None, permutation_metric=sklearn.metrics.r2_score, shap='guess', X_background=None, model_output='raw', cats=None, cats_notencoded=None, idxs=None, index_name=None, target=None, descriptions=None, n_jobs=None, permutation_cv=None, cv=None, na_fill=-999, precision='float64', shap_kwargs=None)¶

Defines the basic functionality that is shared by both ClassifierExplainer and RegressionExplainer.

Parameters

model – a model with a scikit-learn compatible .fit and .predict methods
X (pd.DataFrame) – a pd.DataFrame with your model features
y (pd.Series) – Dependent variable of your model, defaults to None
permutation_metric (function or str) – is a scikit-learn compatible metric function (or string). Defaults to r2_score
shap (str) – type of shap_explainer to fit: ‘tree’, ‘linear’, ‘kernel’. Defaults to ‘guess’.
X_background (pd.DataFrame) – background X to be used by shap explainers that need a background dataset (e.g. shap.KernelExplainer or shap.TreeExplainer with boosting models and model_output=’probability’).
model_output (str) – model_output of shap values, either ‘raw’, ‘logodds’ or ‘probability’. Defaults to ‘raw’ for regression and ‘probability’ for classification.
cats ({dict, list}) – dict of features that have been onehotencoded. e.g. cats={‘Sex’:[‘Sex_male’, ‘Sex_female’]}. If all encoded columns are underscore-seperated (as above), can simply pass a list of prefixes: cats=[‘Sex’]. Allows to group onehot encoded categorical variables together in various plots. Defaults to None.
cats_notencoded (dict) – value to display when all onehot encoded columns are equal to zero. Defaults to ‘NOT_ENCODED’ for each onehot col.
idxs (pd.Series) – list of row identifiers. Can be names, id’s, etc. Defaults to X.index.
index_name (str) – identifier for row indexes. e.g. index_name=’Passenger’. Defaults to X.index.name or idxs.name.
target (Optional[str]) – name of the predicted target, e.g. “Survival”, “Ticket price”, etc. Defaults to y.name.
n_jobs (int) – for jobs that can be parallelized using joblib, how many processes to split the job in. For now only used for calculating permutation importances. Defaults to None.
permutation_cv (int) – Deprecated! Use parameter cv instead! (now also works for calculating metrics)
cv (int) – If not None then permutation importances and metrics will get calculated using cross validation across X. Use this when you are passing the training set to the explainer. Defaults to None.
na_fill (int) – The filler used for missing values, defaults to -999.
precision (str) – precision with which to store values. Defaults to “float64”.
shap_kwargs (dict) – dictionary of keyword arguments to be passed to the shap explainer. most typically used to supress an additivity check e.g. shap_kwargs=dict(check_additivity=False)

get_permutation_importances_df(topx=None, cutoff=None, pos_label=None)¶

dataframe with features ordered by permutation importance.

For more about permutation importances.

see https://explained.ai/rf-importance/index.html

Parameters

topx (int, optional, optional) – only return topx most important features, defaults to None
cutoff (float, optional, optional) – only return features with importance of at least cutoff, defaults to None
pos_label – (Default value = None)

Returns

importance_df

Return type

pd.DataFrame

get_shap_values_df(pos_label=None)¶: SHAP values calculated using the shap library

set_shap_values(base_value, shap_values)¶

Set shap values manually. This is useful if you already have shap values calculated, and do not want to calculate them again inside the explainer instance. Especially for large models and large datasets you may want to calculate shap values on specialized hardware, and then add them to the explainer manually.

Parameters

base_value (float) – the shap intercept generated by e.g. base_value = shap.TreeExplainer(model).shap_values(X_test).expected_value
shap_values (np.ndarray]) – Generated by e.g. shap_values = shap.TreeExplainer(model).shap_values(X_test)

set_shap_interaction_values(shap_interaction_values)¶

Manually set shap interaction values in case you have already pre-computed these elsewhere and do not want to re-calculate them again inside the explainer instance.

Parameters: shap_interaction_values (np.ndarray) – shap interactions values of shape (n, m, m)

get_mean_abs_shap_df(topx=None, cutoff=None, pos_label=None)¶

sorted dataframe with mean_abs_shap

returns a pd.DataFrame with the mean absolute shap values per features, sorted rom highest to lowest.

Parameters

topx (int, optional, optional) – Only return topx most importance features, defaults to None
cutoff (float, optional, optional) – Only return features with mean abs shap of at least cutoff, defaults to None
pos_label – (Default value = None)

Returns

shap_df

Return type

pd.DataFrame

get_importances_df(kind='shap', topx=None, cutoff=None, pos_label=None)¶

wrapper function for get_mean_abs_shap_df() and get_permutation_importance_df()

Parameters

kind (str) – ‘shap’ or ‘permutations’ (Default value = “shap”)
topx – only display topx highest features (Default value = None)
cutoff – only display features above cutoff (Default value = None)
pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

plot_importances(kind='shap', topx=None, round=3, pos_label=None)¶

plot barchart of importances in descending order.

Parameters

type (str, optional) – shap’ for mean absolute shap values, ‘permutation’ for permutation importances, defaults to ‘shap’
topx (int, optional, optional) – Only return topx features, defaults to None
kind – (Default value = ‘shap’)
round – (Default value = 3)
pos_label – (Default value = None)

Returns

fig

Return type

plotly.fig

plot_importances_detailed(highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)¶

Plot barchart of mean absolute shap value.

Displays all individual shap value for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters

highlight_index (str or int) – index to highlight
topx (int, optional) – Only display topx most important features, defaults to None
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
pos_label – positive class (Default value = None)

Returns

plotly.Fig

plot_contributions(index=None, X_row=None, topx=None, cutoff=None, sort='abs', orientation='vertical', higher_is_better=True, round=2, pos_label=None)¶

plot waterfall plot of shap value contributions to the model prediction for index.

Parameters

index (int or str) – index for which to display prediction
X_row (pd.DataFrame single row) – a single row of a features to plot shap contributions for. Can use this instead of index for what-if scenarios.
topx (int, optional, optional) – Only display topx features, defaults to None
cutoff (float, optional, optional) – Only display features with at least cutoff contribution, defaults to None
sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, or low to high, or by order of shap feature importance. Defaults to ‘abs’.
orientation ({'vertical', 'horizontal'}) – Horizontal or vertical bar chart. Horizontal may be better if you have lots of features. Defaults to ‘vertical’.
higher_is_better (bool) – if True, up=green, down=red. If false reversed. Defaults to True.
round (int, optional, optional) – round contributions to round precision, defaults to 2
pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

plot_dependence(col, color_col=None, highlight_index=None, topx=None, sort='alphabet', max_cat_colors=5, round=3, plot_sample=None, remove_outliers=False, pos_label=None)¶

plot shap dependence

Plots a shap dependence plot:

on the x axis the possible values of the feature col
on the y axis the associated individual shap values

Parameters

col (str) – feature to be displayed
color_col (str) – if color_col provided then shap values colored (blue-red) according to feature color_col (Default value = None)
highlight_index – individual observation to be highlighed in the plot. (Default value = None)
topx (int, optional) – for categorical features only display topx categories.
sort (str) – for categorical features, how to sort the categories: alphabetically ‘alphabet’, most frequent first ‘freq’, highest mean absolute value first ‘shap’. Defaults to ‘alphabet’.
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
round (int, optional) – rounding to apply to floats. Defaults to 3.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.
pos_label – positive class (Default value = None)

Returns:

plot_interaction(col, interact_col, highlight_index=None, topx=10, sort='alphabet', max_cat_colors=5, plot_sample=None, remove_outliers=False, pos_label=None)¶

plots a dependence plot for shap interaction effects

Parameters

col (str) – feature for which to find interaction values
interact_col (str) – feature for which interaction value are displayed
highlight_index (str, optional) – index that will be highlighted, defaults to None
topx (int, optional) – number of categorical features to display in violin plots.
sort (str, optional) – how to sort categorical features in violin plots. Should be in {‘alphabet’, ‘freq’, ‘shap’}.
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.
pos_label – (Default value = None)

Returns

Plotly Fig

Return type

plotly.Fig

plot_interactions_detailed(col, highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)¶

Plot barchart of mean absolute shap interaction values

Displays all individual shap interaction values for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters

col (type]) – feature for which to show interactions summary
highlight_index (str or int) – index to highlight
topx (int, optional) – only show topx most important features, defaults to None
max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
pos_label – positive class (Default value = None)

Returns

fig

plot_pdp(col, index=None, X_row=None, drop_na=True, sample=100, gridlines=100, gridpoints=10, sort='freq', round=2, pos_label=None)¶

plot partial dependence plot (pdp)

returns plotly fig for a partial dependence plot showing ice lines for num_grid_lines rows, average pdp based on sample of sample. If index is given, display pdp for this specific index.

Parameters

col (str) – feature to display pdp graph for
index (int or str, optional, optional) – index to highlight in pdp graph, defaults to None
X_row (pd.Dataframe, single row, optional) – a row of features to highlight predictions for. Alternative to passing index.
drop_na (bool, optional, optional) – if true drop samples with value equal to na_fill, defaults to True
sample (int, optional, optional) – sample size on which the average pdp will be calculated, defaults to 100
gridlines (int, optional) – number of ice lines to display, defaults to 100
gridpoints(ints – int, optional): number of points on the x axis to calculate the pdp for, defaults to 10
sort (str, optional) – For categorical features: how to sort: ‘alphabet’, ‘freq’, ‘shap’. Defaults to ‘freq’.
round (int, optional) – round float prediction to number of digits. Defaults to 2.
pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

ClassifierExplainer¶

For classification (e.g. RandomForestClassifier) models you use ClassifierExplainer.

You can pass an additional parameter to __init__() with a list of label names. For multilabel classifier you can set the positive class with e.g. explainer.pos_label=1. This will make sure that for example explainer.pred_probas will return the probability of that label.

More examples in the notebook on the github repo.

class explainerdashboard.explainers.ClassifierExplainer(model, X, y=None, permutation_metric=sklearn.metrics.roc_auc_score, shap='guess', X_background=None, model_output='probability', cats=None, cats_notencoded=None, idxs=None, index_name=None, target=None, descriptions=None, n_jobs=None, permutation_cv=None, cv=None, na_fill=-999, precision='float64', shap_kwargs=None, labels=None, pos_label=1)

Explainer for classification models. Defines the shap values for each possible class in the classification.

You assign the positive label class afterwards with e.g. explainer.pos_label=0

In addition defines a number of plots specific to classification problems such as a precision plot, confusion matrix, roc auc curve and pr auc curve.

Compared to BaseExplainer defines two additional parameters

Parameters

model – a model with a scikit-learn compatible .fit and .predict methods
X (pd.DataFrame) – a pd.DataFrame with your model features
y (pd.Series) – Dependent variable of your model, defaults to None
permutation_metric (function or str) – is a scikit-learn compatible metric function (or string). Defaults to r2_score
shap (str) – type of shap_explainer to fit: ‘tree’, ‘linear’, ‘kernel’. Defaults to ‘guess’.
X_background (pd.DataFrame) – background X to be used by shap explainers that need a background dataset (e.g. shap.KernelExplainer or shap.TreeExplainer with boosting models and model_output=’probability’).
model_output (str) – model_output of shap values, either ‘raw’, ‘logodds’ or ‘probability’. Defaults to ‘raw’ for regression and ‘probability’ for classification.
cats ({dict, list}) – dict of features that have been onehotencoded. e.g. cats={‘Sex’:[‘Sex_male’, ‘Sex_female’]}. If all encoded columns are underscore-seperated (as above), can simply pass a list of prefixes: cats=[‘Sex’]. Allows to group onehot encoded categorical variables together in various plots. Defaults to None.
cats_notencoded (dict) – value to display when all onehot encoded columns are equal to zero. Defaults to ‘NOT_ENCODED’ for each onehot col.
idxs (pd.Series) – list of row identifiers. Can be names, id’s, etc. Defaults to X.index.
index_name (str) – identifier for row indexes. e.g. index_name=’Passenger’. Defaults to X.index.name or idxs.name.
target (Optional[str]) – name of the predicted target, e.g. “Survival”, “Ticket price”, etc. Defaults to y.name.
n_jobs (int) – for jobs that can be parallelized using joblib, how many processes to split the job in. For now only used for calculating permutation importances. Defaults to None.
permutation_cv (int) – Deprecated! Use parameter cv instead! (now also works for calculating metrics)
cv (int) – If not None then permutation importances and metrics will get calculated using cross validation across X. Use this when you are passing the training set to the explainer. Defaults to None.
na_fill (int) – The filler used for missing values, defaults to -999.
precision (str) – precision with which to store values. Defaults to “float64”.
shap_kwargs (dict) – dictionary of keyword arguments to be passed to the shap explainer. most typically used to supress an additivity check e.g. shap_kwargs=dict(check_additivity=False)
labels (list) – list of str labels for the different classes, defaults to e.g. [‘0’, ‘1’] for a binary classification
pos_label (int) – class that should be used as the positive class, defaults to 1

set_shap_values(base_value, shap_values)

Set shap values manually. This is useful if you already have shap values calculated, and do not want to calculate them again inside the explainer instance. Especially for large models and large datasets you may want to calculate shap values on specialized hardware, and then add them to the explainer manually.

Parameters

base_value (list[float]) – list of shap intercept generated by e.g. base_value = shap.TreeExplainer(model).shap_values(X_test).expected_value. Should be a list with a float for each class. For binary classification and some models shap only provides the base value for the positive class, in which case you need to provide [1-base_value, base_value] or [-base_value, base_value] depending on whether the shap values are for probabilities or logodds.
shap_values (list[np.ndarray]) – Generated by e.g. shap_values = shap.TreeExplainer(model).shap_values(X_test) For binary classification and some models shap only provides the shap values for the positive class, in which case you need to provide [1-shap_values, shap_values] or [-shap_values, shap_values] depending on whether the shap values are for probabilities or logodds.

set_shap_interaction_values(shap_interaction_values)

Manually set shap interaction values in case you have already pre-computed these elsewhere and do not want to re-calculate them again inside the explainer instance.

Parameters: shap_interaction_values (np.ndarray) – shap interactions values of shape (n, m, m)

random_index(y_values=None, return_str=False, pred_proba_min=None, pred_proba_max=None, pred_percentile_min=None, pred_percentile_max=None, pos_label=None)

random index satisfying various constraint

Parameters

y_values – list of labels to include (Default value = None)
return_str – return str from self.idxs (Default value = False)
pred_proba_min – minimum pred_proba (Default value = None)
pred_proba_max – maximum pred_proba (Default value = None)
pred_percentile_min – minimum pred_proba percentile (Default value = None)
pred_percentile_max – maximum pred_proba percentile (Default value = None)
pos_label – positive class (Default value = None)

Returns

index

get_precision_df(bin_size=None, quantiles=None, multiclass=False, round=3, pos_label=None)

dataframe with predicted probabilities and precision

Parameters

bin_size (float, optional, optional) – group predictions in bins of size bin_size, defaults to 0.1
quantiles (int, optional, optional) – group predictions in evenly sized quantiles of size quantiles, defaults to None
multiclass (bool, optional, optional) – whether to calculate precision for every class (Default value = False)
round – (Default value = 3)
pos_label – (Default value = None)

Returns

precision_df

Return type

pd.DataFrame

get_liftcurve_df(pos_label=None)

returns a pd.DataFrame with data needed to build a lift curve

Parameters: pos_label – (Default value = None)

Returns:

get_classification_df(cutoff=0.5, pos_label=None)

Returns a dataframe with number of observations in each class above and below the cutoff.

Parameters

cutoff (float, optional) – Cutoff to split on. Defaults to 0.5.
pos_label (int, optional) – Pos label to generate dataframe for. Defaults to self.pos_label.

Returns

pd.DataFrame

plot_precision(bin_size=None, quantiles=None, cutoff=None, multiclass=False, pos_label=None)

plot precision vs predicted probability

plots predicted probability on the x-axis and observed precision (fraction of actual positive cases) on the y-axis.

Should pass either bin_size fraction of number of quantiles, but not both.

Parameters

bin_size (float, optional) – size of the bins on x-axis (e.g. 0.05 for 20 bins)
quantiles (int, optional) – number of equal sized quantiles to split the predictions by e.g. 20, optional)
cutoff – cutoff of model to include in the plot (Default value = None)
multiclass – whether to display all classes or only positive class, defaults to False
pos_label – positive label to display, defaults to self.pos_label

Returns

Plotly fig

plot_cumulative_precision(percentile=None, pos_label=None)

plot cumulative precision

returns a cumulative precision plot, which is a slightly different representation of a lift curve.

Parameters: pos_label – positive label to display, defaults to self.pos_label
Returns: plotly fig

plot_confusion_matrix(cutoff=0.5, percentage=False, normalize='all', binary=False, pos_label=None)

plot of a confusion matrix.

Parameters

cutoff (float, optional, optional) – cutoff of positive class to calculate confusion matrix for, defaults to 0.5
percentage (bool, optional, optional) – display percentages instead of counts , defaults to False
normalize (str[‘observed’, ‘pred’, ‘all’]) – normalizes confusion matrix over the observed (rows), predicted (columns) conditions or all the population. Defaults to all.
binary (bool, optional, optional) – if multiclass display one-vs-rest instead, defaults to False
pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_lift_curve(cutoff=None, percentage=False, add_wizard=True, round=2, pos_label=None)

plot of a lift curve.

Parameters

cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = None)
percentage (bool, optional) – display percentages instead of counts, defaults to False
add_wizard (bool, optional) – Add a line indicating how a perfect model would perform (“the wizard”). Defaults to True.
round – number of digits to round to (Default value = 2)
pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_classification(cutoff=0.5, percentage=True, pos_label=None)

plot showing a barchart of the classification result for cutoff

Parameters

cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = 0.5)
percentage (bool, optional) – display percentages instead of counts, defaults to True
pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_roc_auc(cutoff=0.5, pos_label=None)

plots ROC_AUC curve.

The TPR and FPR of a particular cutoff is displayed in crosshairs.

Parameters

cutoff – cutoff value to be included in plot (Default value = 0.5)
pos_label – (Default value = None)

Returns:

plot_pr_auc(cutoff=0.5, pos_label=None)

plots PR_AUC curve.

the precision and recall of particular cutoff is displayed in crosshairs.

Parameters

cutoff – cutoff value to be included in plot (Default value = 0.5)
pos_label – (Default value = None)

Returns:

RegressionExplainer¶

For regression models (e.g. RandomForestRegressor) models you use RegressionExplainer.

You can pass units as an additional parameter for the units of the target variable (e.g. units="$").

More examples in the notebook on the github repo.

class explainerdashboard.explainers.RegressionExplainer(model, X, y=None, permutation_metric=sklearn.metrics.r2_score, shap='guess', X_background=None, model_output='raw', cats=None, cats_notencoded=None, idxs=None, index_name=None, target=None, descriptions=None, n_jobs=None, permutation_cv=None, cv=None, na_fill=-999, precision='float64', shap_kwargs=None, units='')

Explainer for regression models.

In addition to BaseExplainer defines a number of plots specific to regression problems such as a predicted vs actual and residual plots.

Combared to BaseExplainerBunch defines two additional parameters.

Parameters

model – a model with a scikit-learn compatible .fit and .predict methods
X (pd.DataFrame) – a pd.DataFrame with your model features
y (pd.Series) – Dependent variable of your model, defaults to None
permutation_metric (function or str) – is a scikit-learn compatible metric function (or string). Defaults to r2_score
shap (str) – type of shap_explainer to fit: ‘tree’, ‘linear’, ‘kernel’. Defaults to ‘guess’.
X_background (pd.DataFrame) – background X to be used by shap explainers that need a background dataset (e.g. shap.KernelExplainer or shap.TreeExplainer with boosting models and model_output=’probability’).
model_output (str) – model_output of shap values, either ‘raw’, ‘logodds’ or ‘probability’. Defaults to ‘raw’ for regression and ‘probability’ for classification.
cats ({dict, list}) – dict of features that have been onehotencoded. e.g. cats={‘Sex’:[‘Sex_male’, ‘Sex_female’]}. If all encoded columns are underscore-seperated (as above), can simply pass a list of prefixes: cats=[‘Sex’]. Allows to group onehot encoded categorical variables together in various plots. Defaults to None.
cats_notencoded (dict) – value to display when all onehot encoded columns are equal to zero. Defaults to ‘NOT_ENCODED’ for each onehot col.
idxs (pd.Series) – list of row identifiers. Can be names, id’s, etc. Defaults to X.index.
index_name (str) – identifier for row indexes. e.g. index_name=’Passenger’. Defaults to X.index.name or idxs.name.
target (Optional[str]) – name of the predicted target, e.g. “Survival”, “Ticket price”, etc. Defaults to y.name.
n_jobs (int) – for jobs that can be parallelized using joblib, how many processes to split the job in. For now only used for calculating permutation importances. Defaults to None.
permutation_cv (int) – Deprecated! Use parameter cv instead! (now also works for calculating metrics)
cv (int) – If not None then permutation importances and metrics will get calculated using cross validation across X. Use this when you are passing the training set to the explainer. Defaults to None.
na_fill (int) – The filler used for missing values, defaults to -999.
precision (str) – precision with which to store values. Defaults to “float64”.
shap_kwargs (dict) – dictionary of keyword arguments to be passed to the shap explainer. most typically used to supress an additivity check e.g. shap_kwargs=dict(check_additivity=False)
units (str) – units to display for regression quantity

property residuals

y-preds

Type: residuals

random_index(y_min=None, y_max=None, pred_min=None, pred_max=None, residuals_min=None, residuals_max=None, abs_residuals_min=None, abs_residuals_max=None, return_str=False, **kwargs)

random index following to various exclusion criteria

Parameters

y_min – (Default value = None)
y_max – (Default value = None)
pred_min – (Default value = None)
pred_max – (Default value = None)
residuals_min – (Default value = None)
residuals_max – (Default value = None)
abs_residuals_min – (Default value = None)
abs_residuals_max – (Default value = None)
return_str – return the str index from self.idxs (Default value = False)
**kwargs –

Returns

a random index that fits the exclusion criteria

metrics(show_metrics=None)

dict of performance metrics: root_mean_squared_error, mean_absolute_error and R-squared

Parameters: show_metrics (List) – list of metrics to display in order. Defaults to None, displaying all metrics.

plot_predicted_vs_actual(round=2, logs=False, log_x=False, log_y=False, plot_sample=None, **kwargs)

plot with predicted value on x-axis and actual value on y axis.

Parameters

round (int, optional) – rounding to apply to outcome, defaults to 2
logs (bool, optional) – log both x and y axis, defaults to False
log_y (bool, optional) – only log x axis. Defaults to False.
log_x (bool, optional) – only log y axis. Defaults to False.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)
**kwargs –

Returns

Plotly fig

plot_residuals(vs_actual=False, round=2, residuals='difference', plot_sample=None)

plot of residuals. x-axis is the predicted outcome by default

Parameters

vs_actual (bool, optional) – use actual value for x-axis, defaults to False
round (int, optional) – rounding to perform on values, defaults to 2
residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

Plotly fig

plot_residuals_vs_feature(col, residuals='difference', round=2, dropna=True, points=True, winsor=0, topx=None, sort='alphabet', plot_sample=None)

Plot residuals vs individual features

Parameters

col (str) – Plot against feature col
residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.
round (int, optional) – rounding to perform on residuals, defaults to 2
dropna (bool, optional) – drop missing values from plot, defaults to True.
points (bool, optional) – display point cloud next to violin plot. Defaults to True.
winsor (int, 0-50, optional) – percentage of outliers to winsor out of the y-axis. Defaults to 0.
plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

plotly fig