Explainers

Simple example

In order to start an ExplainerDashboard you first need to construct an Explainer instance. They come in two flavours and at its most basic they only need a model, and a test set X and y:

from explainerdashboard import ClassifierExplainer, RegressionExplainer

explainer = ClassifierExplainer(model, X_test, y_test)
explainer = RegressionExplainer(model, X_test, y_test)

This is enough to launch an ExplainerDashboard:

from explainerdashboard import ExplainerDashboard
ExplainerDashboard(explainer).run()
_images/screenshot.gif

Or you can use it interactively in a notebook to inspect your model using the built-in plotting methods, e.g.:

explainer.plot_confusion_matrix()
explainer.plot_contributions(index=0)
explainer.plot_dependence("Fare", color_col="Sex")
_images/notebook_screenshot.png

For the full lists of plots available see Plots.

Or you can start an interactive ExplainerComponent in your notebook using InlineExplainer, e.g.:

from explainerdashboard import InlineExplainer

InlineExplainer(explainer).tab.importances()
InlineExplainer(explainer).classifier.roc_auc()
InlineExplainer(explainer).regression.residuals_vs_col()
InlineExplainer(explainer).shap.overview()
_images/inline_screenshot.gif

Parameters

There are a number of optional parameters that can either make sure that SHAP values get calculated in the appropriate way, or that make the explainer give a bit nicer and more convenient output:

ClassifierExplainer(model, X_test, y_test,
        shap='linear', # manually set shap type, overrides default 'guess'
        X_background=X_train, # set background dataset for shap calculations
        model_output='logodds', # set model_output to logodds (vs probability)
        cats=['Sex', 'Deck', 'Embarked'], # makes it easy to group onehotencoded vars
        idxs=test_names, # index with str identifier
        index_name="Passenger", # description of index
        descriptions=feature_descriptions, # show long feature descriptions in hovers
        target='Survival', # the name of the target variable (y)
        precision='float32', # save memory by setting lower precision. Default is 'float64'
        labels=['Not survived', 'Survived']) # show target labels instead of ['0', '1']

cats

If you have onehot-encoded your categorical variables, they will show up as a lot of independent features. This clutters your feature space, and often makes it hard to interpret the effect of the underlying categorical feature.

You can pass a dict to the parameter cats specifying which are the onehotencoded columns, and what the grouped feature name should be:

ClassifierExplainer(model, X, y, cats={'Gender': ['Sex_male', 'Sex_female']})

However if you encoded your feature with pd.get_dummies(df, prefix=['Name']), then the resulting onehot encoded columns should be named ‘Name_John’, ‘Name_Mary’, Name_Bob’, etc. (or in general CategoricalFeature_Category), then you can simply pass a list of the prefixes to cats:

ClassifierExplainer(model, X, y, cats=['Sex', 'Deck', 'Embarked'])

And you can also combine the two methods:

ClassifierExplainer(model, X, y,
    cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'])

You can now use these categorical features directly as input for plotting methods, e.g. explainer.plot_dependence("Deck"), which will now generate violin plots instead of the default scatter plots.

cats_notencoded

When you have onehotencoded a categorical feature, you may have dropped some columns during feature selection. Or there are new categories in the test set that were not encoded as columns in the training set. In that cases all columns in your onehot encoding may be equal to 0 for some rows. By default the value assigned to the aggregated feature for such cases is 'NOT_ENCODED', but this can be overriden with the cats_notencoded parameter:

ClassifierExplainer(model, X, y,
    cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'],
    cats_notencoded={'Gender': 'Gender Other', 'Deck': 'Unknown Deck', 'Embarked':'Stowaway'})

idxs

You may have specific identifiers (names, customer id’s, etc) for each row in your dataset. By default X.index will get used to identify individual rows/records in the dashboard. And you can index using both the numerical index, e.g. explainer.get_contrib_df(0) for the first row, or using the identifier, e.g. explainer.get_contrib_df("Braund, Mr. Owen Harris").

You can override using X.index by passing a list/array/Series idxs to the explainer:

from explainerdashboard.datasets import titanic_names

test_names = titanic_names(test_only=True)
ClassifierExplainer(model, X_test, y_test, idxs=test_names)

index_name

By default X.index.name or idxs.name is used as the description of the index, but you can also pass it explicitly, e.g.: index_name="Passenger".

descriptions

descriptions can be passed as a dictionary of descriptions for each feature. In order to be explanatory, you often have to explain the meaning of the features themselves (especially if the naming is not obvious). Passing the dict along to descriptions will show hover-over tooltips for the various features in the dashboard. If you grouped onehotencoded features with the cats parameter, you can also give descriptions of these groups, e.g:

ClassifierExplainer(model, X, y,
    cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'],
    descriptions={
        'Gender': 'Gender of the passenger',
        'Fare': 'The price of the ticket paid for by the passenger',
        'Deck': 'The deck of the cabin of the passenger',
        'Age': 'Age of the passenger in year'
    })

target

Name of the target variable. By default the name of the y (y.name) is used if y is a pd.Series, else it defaults to 'target', bu this can be overriden:

ClassifierExplainer(model, X, y, target="Survival")

labels

For ClassifierExplainer only: The outcome variables for a classification y are assumed to be encoded 0, 1 (, 2, 3, ...) You can assign string labels by passing e.g. labels=['Not survived', 'Survived']:

ClassifierExplainer(model, X, y, labels=['Not survived', 'Survived'])

units

For RegressionExplainer only: the units of the y variable. E.g. if the model is predicting house prices in dollars you can set units='$'. If it is predicting maintenance time you can set units='hours', etc. This will then be displayed along the axis of various plots:

RegressionExplainer(model, X, y, units="$")

X_background

Some models like sklearn LogisticRegression (as well as certain gradient boosting algorithms such as xgboost in probability space) need a background dataset to calculate shap values. These can be passed as X_background. If you don’t pass an X_background, Explainer uses X instead but gives off a warning. (You want to limit the size of X_background in order to keep the SHAP calculations from getting too slow. Usually a representative background dataset of a couple of hunderd rows should be enough to get decent shap values.)

model_output

By default model_output for classifiers is set to "probability", as this is more intuitively explainable to non data scientist stakeholders. However certain models (e.g. XGBClassifier, LGBMCLassifier, CatBoostClassifier), need a background dataset X_background to calculate SHAP values in probability space, and are not able to calculate shap interaction values in probability space at all. Therefore you can also pass model_output=’logodds’, in which case shap values get calculated faster and interaction effects can be studied. Now you just need to explain to your stakeholders what logodds are :)

shap

By default shap='guess', which means that the Explainer will try to guess based on the model what kind of shap explainer it needs: e.g. shap.TreeExplainer(...), shap.LinearExplainer(...), etc.

In case the guess fails or you’d like to override it, you can set it manually: e.g. shap='tree' for shap.TreeExplainer, shap='linear' for shap.LinearExplainer, shap='kernel' for shap.KernelExplainer, shap='deep' for shap.DeepExplainer, etc.

model_output, X_background example

An example of using setting X_background and model_output with a LogisticRegression:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

explainer = ClassifierExplainer(model, X_test, y_test,
                                    shap='linear',
                                    X_background=X_train,
                                    model_output='logodds')
ExplainerDashboard(explainer).run()

cv

Normally metrics and permutation importances get calculated over a single fold (assuming the data X is the test set). However if you pass the training set to the explainer, you may wish to cross-validate calculate the permutation importances and metrics. In that case pass the number of folds to cv. Note that custom metrics do not work with cross validation for now.

na_fill

If you fill missing values with some extreme value such as -999 (typical for tree based methods), these can mess with the horizontal axis of your plots. In order to filter these out, you need to tell the explainer what the extreme value is that you used to fill. Defaults to -999.

precision

You can set the precision of the calculated shap values, predictions, etc, in order to save on memory usage. Default is 'float64', but 'float32' is probably fine, maybe even 'float16' for your application.

Pre-calculated shap values

Perhaps you already have calculated the shap values somewhere, or you can calculate them off on a giant cluster somewhere, or your model supports GPU generated shap values.

You can simply add these pre-calculated shap values to the explainer with explainer.set_shap_values() and explainer.set_shap_interaction_values() methods.

Plots

Shared Plots

The abstract base class BaseExplainer defines most of the functionality such as feature importances (both SHAP and permutation based), SHAP values, SHAP interaction values partial dependences, individual contributions, etc. Along with a number of convenient plotting methods. In practice you will use ClassifierExplainer or RegressionExplainer, however they both inherit all of these basic methods:

plot_importances(kind='shap', topx=None, round=3, pos_label=None)
plot_contributions(index, topx=None, cutoff=None, round=2, pos_label=None)
plot_importances_detailed(topx=None, pos_label=None)
plot_interactions_detailed(col, topx=None, pos_label=None)
plot_dependence(col, color_col=None, highlight_idx=None, pos_label=None)
plot_interaction(interact_col, highlight_idx=None, pos_label=None)
plot_pdp(col, index=None, drop_na=True, sample=100, num_grid_lines=100, num_grid_points=10, pos_label=None)

example code:

explainer = ClassifierExplainer(model, X, y, cats=['Sex', 'Deck', 'Embarked'])
explainer.plot_importances()
explainer.plot_contributions(index=0, topx=5)
explainer.plot_dependence("Fare")
explainer.plot_interaction("Embarked", "PassengerClass")
explainer.plot_pdp("Sex", index=0)

plot_importances

BaseExplainer.plot_importances(kind='shap', topx=None, round=3, pos_label=None)

plot barchart of importances in descending order.

Parameters
  • type (str, optional) – shap’ for mean absolute shap values, ‘permutation’ for permutation importances, defaults to ‘shap’

  • topx (int, optional, optional) – Only return topx features, defaults to None

  • kind – (Default value = ‘shap’)

  • round – (Default value = 3)

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.fig

plot_importances_detailed

BaseExplainer.plot_importances_detailed(highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)

Plot barchart of mean absolute shap value.

Displays all individual shap value for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters
  • highlight_index (str or int) – index to highlight

  • topx (int, optional) – Only display topx most important features, defaults to None

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • pos_label – positive class (Default value = None)

Returns

plotly.Fig

plot_contributions

BaseExplainer.plot_contributions(index=None, X_row=None, topx=None, cutoff=None, sort='abs', orientation='vertical', higher_is_better=True, round=2, pos_label=None)

plot waterfall plot of shap value contributions to the model prediction for index.

Parameters
  • index (int or str) – index for which to display prediction

  • X_row (pd.DataFrame single row) – a single row of a features to plot shap contributions for. Can use this instead of index for what-if scenarios.

  • topx (int, optional, optional) – Only display topx features, defaults to None

  • cutoff (float, optional, optional) – Only display features with at least cutoff contribution, defaults to None

  • sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, or low to high, or by order of shap feature importance. Defaults to ‘abs’.

  • orientation ({'vertical', 'horizontal'}) – Horizontal or vertical bar chart. Horizontal may be better if you have lots of features. Defaults to ‘vertical’.

  • higher_is_better (bool) – if True, up=green, down=red. If false reversed. Defaults to True.

  • round (int, optional, optional) – round contributions to round precision, defaults to 2

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

plot_dependence

BaseExplainer.plot_dependence(col, color_col=None, highlight_index=None, topx=None, sort='alphabet', max_cat_colors=5, round=3, plot_sample=None, remove_outliers=False, pos_label=None)

plot shap dependence

Plots a shap dependence plot:
  • on the x axis the possible values of the feature col

  • on the y axis the associated individual shap values

Parameters
  • col (str) – feature to be displayed

  • color_col (str) – if color_col provided then shap values colored (blue-red) according to feature color_col (Default value = None)

  • highlight_index – individual observation to be highlighed in the plot. (Default value = None)

  • topx (int, optional) – for categorical features only display topx categories.

  • sort (str) – for categorical features, how to sort the categories: alphabetically ‘alphabet’, most frequent first ‘freq’, highest mean absolute value first ‘shap’. Defaults to ‘alphabet’.

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • round (int, optional) – rounding to apply to floats. Defaults to 3.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.

  • pos_label – positive class (Default value = None)

Returns:

plot_interaction

BaseExplainer.plot_interaction(col, interact_col, highlight_index=None, topx=10, sort='alphabet', max_cat_colors=5, plot_sample=None, remove_outliers=False, pos_label=None)

plots a dependence plot for shap interaction effects

Parameters
  • col (str) – feature for which to find interaction values

  • interact_col (str) – feature for which interaction value are displayed

  • highlight_index (str, optional) – index that will be highlighted, defaults to None

  • topx (int, optional) – number of categorical features to display in violin plots.

  • sort (str, optional) – how to sort categorical features in violin plots. Should be in {‘alphabet’, ‘freq’, ‘shap’}.

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.

  • pos_label – (Default value = None)

Returns

Plotly Fig

Return type

plotly.Fig

plot_pdp

BaseExplainer.plot_pdp(col, index=None, X_row=None, drop_na=True, sample=100, gridlines=100, gridpoints=10, sort='freq', round=2, pos_label=None)

plot partial dependence plot (pdp)

returns plotly fig for a partial dependence plot showing ice lines for num_grid_lines rows, average pdp based on sample of sample. If index is given, display pdp for this specific index.

Parameters
  • col (str) – feature to display pdp graph for

  • index (int or str, optional, optional) – index to highlight in pdp graph, defaults to None

  • X_row (pd.Dataframe, single row, optional) – a row of features to highlight predictions for. Alternative to passing index.

  • drop_na (bool, optional, optional) – if true drop samples with value equal to na_fill, defaults to True

  • sample (int, optional, optional) – sample size on which the average pdp will be calculated, defaults to 100

  • gridlines (int, optional) – number of ice lines to display, defaults to 100

  • gridpoints(ints – int, optional): number of points on the x axis to calculate the pdp for, defaults to 10

  • sort (str, optional) – For categorical features: how to sort: ‘alphabet’, ‘freq’, ‘shap’. Defaults to ‘freq’.

  • round (int, optional) – round float prediction to number of digits. Defaults to 2.

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

plot_interactions_importance

BaseExplainer.plot_interactions_importance(col, topx=None, pos_label=None)

plot mean absolute shap interaction value for col.

Parameters
  • col – column for which to generate shap interaction value

  • topx (int, optional, optional) – Only return topx features, defaults to None

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.fig

plot_interactions_detailed

BaseExplainer.plot_interactions_detailed(col, highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)

Plot barchart of mean absolute shap interaction values

Displays all individual shap interaction values for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters
  • col (type]) – feature for which to show interactions summary

  • highlight_index (str or int) – index to highlight

  • topx (int, optional) – only show topx most important features, defaults to None

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • pos_label – positive class (Default value = None)

Returns

fig

Classifier Plots

ClassifierExplainer defines a number of additional plotting methods:

plot_precision(bin_size=None, quantiles=None, cutoff=None, multiclass=False, pos_label=None)
plot_cumulative_precision(pos_label=None)
plot_classification(cutoff=0.5, percentage=True, pos_label=None)
plot_confusion_matrix(cutoff=0.5, normalized=False, binary=False, pos_label=None)
plot_lift_curve(cutoff=None, percentage=False, round=2, pos_label=None)
plot_roc_auc(cutoff=0.5, pos_label=None)
plot_pr_auc(cutoff=0.5, pos_label=None)

example code:

explainer = ClassifierExplainer(model, X, y, labels=['Not Survived', 'Survived'])
explainer.plot_confusion_matrix(cutoff=0.6)
explainer.plot_precision(quantiles=10, cutoff=0.6, multiclass=True)
explainer.plot_lift_curve(percentage=True)
explainer.plot_roc_auc(cutoff=0.7)
explainer.plot_pr_auc(cutoff=0.3)

More examples in the notebook on the github repo.

plot_precision

ClassifierExplainer.plot_precision(bin_size=None, quantiles=None, cutoff=None, multiclass=False, pos_label=None)

plot precision vs predicted probability

plots predicted probability on the x-axis and observed precision (fraction of actual positive cases) on the y-axis.

Should pass either bin_size fraction of number of quantiles, but not both.

Parameters
  • bin_size (float, optional) – size of the bins on x-axis (e.g. 0.05 for 20 bins)

  • quantiles (int, optional) – number of equal sized quantiles to split the predictions by e.g. 20, optional)

  • cutoff – cutoff of model to include in the plot (Default value = None)

  • multiclass – whether to display all classes or only positive class, defaults to False

  • pos_label – positive label to display, defaults to self.pos_label

Returns

Plotly fig

plot_cumulative_precision

ClassifierExplainer.plot_cumulative_precision(percentile=None, pos_label=None)

plot cumulative precision

returns a cumulative precision plot, which is a slightly different representation of a lift curve.

Parameters

pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_classification

ClassifierExplainer.plot_classification(cutoff=0.5, percentage=True, pos_label=None)

plot showing a barchart of the classification result for cutoff

Parameters
  • cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = 0.5)

  • percentage (bool, optional) – display percentages instead of counts, defaults to True

  • pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_confusion_matrix

ClassifierExplainer.plot_confusion_matrix(cutoff=0.5, percentage=False, normalize='all', binary=False, pos_label=None)

plot of a confusion matrix.

Parameters
  • cutoff (float, optional, optional) – cutoff of positive class to calculate confusion matrix for, defaults to 0.5

  • percentage (bool, optional, optional) – display percentages instead of counts , defaults to False

  • normalize (str[‘observed’, ‘pred’, ‘all’]) – normalizes confusion matrix over the observed (rows), predicted (columns) conditions or all the population. Defaults to all.

  • binary (bool, optional, optional) – if multiclass display one-vs-rest instead, defaults to False

  • pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_lift_curve

ClassifierExplainer.plot_lift_curve(cutoff=None, percentage=False, add_wizard=True, round=2, pos_label=None)

plot of a lift curve.

Parameters
  • cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = None)

  • percentage (bool, optional) – display percentages instead of counts, defaults to False

  • add_wizard (bool, optional) – Add a line indicating how a perfect model would perform (“the wizard”). Defaults to True.

  • round – number of digits to round to (Default value = 2)

  • pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_roc_auc

ClassifierExplainer.plot_roc_auc(cutoff=0.5, pos_label=None)

plots ROC_AUC curve.

The TPR and FPR of a particular cutoff is displayed in crosshairs.

Parameters
  • cutoff – cutoff value to be included in plot (Default value = 0.5)

  • pos_label – (Default value = None)

Returns:

plot_pr_auc

ClassifierExplainer.plot_pr_auc(cutoff=0.5, pos_label=None)

plots PR_AUC curve.

the precision and recall of particular cutoff is displayed in crosshairs.

Parameters
  • cutoff – cutoff value to be included in plot (Default value = 0.5)

  • pos_label – (Default value = None)

Returns:

Regression Plots

For the derived RegressionExplainer class again some additional plots:

explainer.plot_predicted_vs_actual(...)
explainer.plot_residuals(...)
explainer.plot_residuals_vs_feature(...)

plot_predicted_vs_actual

RegressionExplainer.plot_predicted_vs_actual(round=2, logs=False, log_x=False, log_y=False, plot_sample=None, **kwargs)

plot with predicted value on x-axis and actual value on y axis.

Parameters
  • round (int, optional) – rounding to apply to outcome, defaults to 2

  • logs (bool, optional) – log both x and y axis, defaults to False

  • log_y (bool, optional) – only log x axis. Defaults to False.

  • log_x (bool, optional) – only log y axis. Defaults to False.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • **kwargs

Returns

Plotly fig

plot_residuals

RegressionExplainer.plot_residuals(vs_actual=False, round=2, residuals='difference', plot_sample=None)

plot of residuals. x-axis is the predicted outcome by default

Parameters
  • vs_actual (bool, optional) – use actual value for x-axis, defaults to False

  • round (int, optional) – rounding to perform on values, defaults to 2

  • residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

Plotly fig

plot_residuals_vs_feature

RegressionExplainer.plot_residuals_vs_feature(col, residuals='difference', round=2, dropna=True, points=True, winsor=0, topx=None, sort='alphabet', plot_sample=None)

Plot residuals vs individual features

Parameters
  • col (str) – Plot against feature col

  • residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.

  • round (int, optional) – rounding to perform on residuals, defaults to 2

  • dropna (bool, optional) – drop missing values from plot, defaults to True.

  • points (bool, optional) – display point cloud next to violin plot. Defaults to True.

  • winsor (int, 0-50, optional) – percentage of outliers to winsor out of the y-axis. Defaults to 0.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

plotly fig

DecisionTree Plots

There are additional mixin classes specifically for sklearn RandomForests and for xgboost models that define additional methods and plots to investigate and visualize individual decision trees within the ensemblke. These uses the dtreeviz library to visualize individual decision trees.

You can get a pd.DataFrame summary of the path that a specific index row took through a specific decision tree. You can also plot the individual predictions of each individual tree for specific row in your data indentified by index:

explainer.get_decisionpath_df(tree_idx, index)
explainer.get_decisionpath_summary_df(tree_idx, index)
explainer.plot_trees(index)

And for dtreeviz visualization of individual decision trees (svg format):

explainer.decisiontree(tree_idx, index)
explainer.decisiontree_file(tree_idx, index)
explainer.decisiontree_encoded(tree_idx, index)

These methods are part of the RandomForestExplainer and XGBExplainer`` mixin classes that get automatically loaded when you pass either a RandomForest or XGBoost model.

plot_trees

RandomForestExplainer.plot_trees(index, highlight_tree=None, round=2, higher_is_better=True, pos_label=None)

plot barchart predictions of each individual prediction tree

Parameters
  • index – index to display predictions for

  • highlight_tree – tree to highlight in plot (Default value = None)

  • round – rounding of numbers in plot (Default value = 2)

  • higher_is_better (bool) – flip red and green. Dummy bool for compatibility with gbm plot_trees().

  • pos_label – positive class (Default value = None)

Returns:

decisiontree

RandomForestExplainer.decisiontree(tree_idx, index, show_just_path=False)

get a dtreeviz visualization of a particular tree in the random forest.

Parameters
  • tree_idx – the n’th tree in the random forest

  • index – row index

  • show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a IPython display SVG object for e.g. jupyter notebook.

decisiontree_file

RandomForestExplainer.decisiontree_file(tree_idx, index, show_just_path=False)

decisiontree_encoded

RandomForestExplainer.decisiontree_encoded(tree_idx, index, show_just_path=False)

get a dtreeviz visualization of a particular tree in the random forest.

Parameters
  • tree_idx – the n’th tree in the random forest

  • index – row index

  • show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a base64 encoded image, for inclusion in websites (e.g. dashboard)

Other explainer outputs

Base outputs

Some other useful tables and outputs you can get out of the explainer:

metrics()
get_mean_abs_shap_df(topx=None, cutoff=None, cats=False, pos_label=None)
get_permutation_importances_df(topx=None, cutoff=None, cats=False, pos_label=None)
get_importances_df(kind="shap", topx=None, cutoff=None, cats=False, pos_label=None)
get_contrib_df(index, cats=True, topx=None, cutoff=None, pos_label=None)
get_contrib_summary_df(index, cats=True, topx=None, cutoff=None, round=2, pos_label=None)
get_interactions_df(col, cats=False, topx=None, cutoff=None, pos_label=None)

metrics

BaseExplainer.metrics(*args, **kwargs)

returns a dict of metrics.

Implemented by either ClassifierExplainer or RegressionExplainer

metrics_descriptions

ClassifierExplainer.metrics_descriptions(cutoff=0.5, round=3, pos_label=None)

Returns a metrics dict with the value replaced with a description/interpretation of the value

Parameters
  • cutoff (float, optional) – Cutoff for calculating the metrics. Defaults to 0.5.

  • round (int, optional) – Round to apply to floats. Defaults to 3.

  • pos_label (None, optional) – positive label. Defaults to None.

Returns

dict

RegressionExplainer.metrics_descriptions(round=2)

Returns a metrics dict, with the metric values replaced by a descriptive string, explaining/interpreting the value of the metric

Returns

dict

get_mean_abs_shap_df

BaseExplainer.get_mean_abs_shap_df(topx=None, cutoff=None, pos_label=None)

sorted dataframe with mean_abs_shap

returns a pd.DataFrame with the mean absolute shap values per features, sorted rom highest to lowest.

Parameters
  • topx (int, optional, optional) – Only return topx most importance features, defaults to None

  • cutoff (float, optional, optional) – Only return features with mean abs shap of at least cutoff, defaults to None

  • pos_label – (Default value = None)

Returns

shap_df

Return type

pd.DataFrame

get_permutation_importances_df

BaseExplainer.get_permutation_importances_df(topx=None, cutoff=None, pos_label=None)

dataframe with features ordered by permutation importance.

For more about permutation importances.

see https://explained.ai/rf-importance/index.html

Parameters
  • topx (int, optional, optional) – only return topx most important features, defaults to None

  • cutoff (float, optional, optional) – only return features with importance of at least cutoff, defaults to None

  • pos_label – (Default value = None)

Returns

importance_df

Return type

pd.DataFrame

get_importances_df

BaseExplainer.get_importances_df(kind='shap', topx=None, cutoff=None, pos_label=None)

wrapper function for get_mean_abs_shap_df() and get_permutation_importance_df()

Parameters
  • kind (str) – ‘shap’ or ‘permutations’ (Default value = “shap”)

  • topx – only display topx highest features (Default value = None)

  • cutoff – only display features above cutoff (Default value = None)

  • pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

get_contrib_df

BaseExplainer.get_contrib_df(index=None, X_row=None, topx=None, cutoff=None, sort='abs', pos_label=None)

shap value contributions to the prediction for index.

Used as input for the plot_contributions() method.

Parameters
  • index (int or str) – index for which to calculate contributions

  • X_row (pd.DataFrame, single row) – single row of feature for which to calculate contrib_df. Can us this instead of index

  • topx (int, optional) – Only return topx features, remainder called REST, defaults to None

  • cutoff (float, optional) – only return features with at least cutoff contributions, defaults to None

  • sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, low to high, or ordered by the global shap importances. Defaults to ‘abs’.

  • pos_label – (Default value = None)

Returns

contrib_df

Return type

pd.DataFrame

get_contrib_summary_df

BaseExplainer.get_contrib_summary_df(index=None, X_row=None, topx=None, cutoff=None, round=2, sort='abs', pos_label=None)

Takes a contrib_df, and formats it to a more human readable format

Parameters
  • index – index to show contrib_summary_df for

  • X_row (pd.DataFrame, single row) – single row of feature for which to calculate contrib_df. Can us this instead of index

  • topx – Only show topx highest features(Default value = None)

  • cutoff – Only show features above cutoff (Default value = None)

  • round – round figures (Default value = 2)

  • sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, or low to high, or ordered by the global shap importances. Defaults to ‘abs’.

  • pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

get_interactions_df

BaseExplainer.get_interactions_df(col, topx=None, cutoff=None, pos_label=None)

dataframe of mean absolute shap interaction values for col

Parameters
  • col – Feature to get interactions_df for

  • topx – Only display topx most important features (Default value = None)

  • cutoff – Only display features with mean abs shap of at least cutoff (Default value = None)

  • pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

Classifier outputs

For ClassifierExplainer in addition:

random_index(y_values=None, return_str=False,pred_proba_min=None, pred_proba_max=None,
                pred_percentile_min=None, pred_percentile_max=None, pos_label=None)
prediction_result_df(index, pos_label=None)
cutoff_from_percentile(percentile, pos_label=None)
get_precision_df(bin_size=None, quantiles=None, multiclass=False, round=3, pos_label=None)
get_liftcurve_df(pos_label=None)

random_index

ClassifierExplainer.random_index(y_values=None, return_str=False, pred_proba_min=None, pred_proba_max=None, pred_percentile_min=None, pred_percentile_max=None, pos_label=None)

random index satisfying various constraint

Parameters
  • y_values – list of labels to include (Default value = None)

  • return_str – return str from self.idxs (Default value = False)

  • pred_proba_min – minimum pred_proba (Default value = None)

  • pred_proba_max – maximum pred_proba (Default value = None)

  • pred_percentile_min – minimum pred_proba percentile (Default value = None)

  • pred_percentile_max – maximum pred_proba percentile (Default value = None)

  • pos_label – positive class (Default value = None)

Returns

index

cutoff_from_percentile

ClassifierExplainer.cutoff_from_percentile(percentile, pos_label=None)

The cutoff equivalent to the percentile given

For example if you want the cutoff that splits the highest 20% pred_proba from the lowest 80%, you would set percentile=0.8 and get the correct cutoff.

Parameters
  • percentile (float) – percentile to convert to cutoff

  • pos_label – positive class (Default value = None)

Returns

cutoff

percentile_from_cutoff

ClassifierExplainer.percentile_from_cutoff(cutoff, pos_label=None)

The percentile equivalent to the cutoff given

For example if set the cutoff at 0.8, then what percentage of pred_proba is above this cutoff?

Parameters
  • cutoff (float) – cutoff to convert to percentile

  • pos_label – positive class (Default value = None)

Returns

percentile

get_precision_df

ClassifierExplainer.get_precision_df(bin_size=None, quantiles=None, multiclass=False, round=3, pos_label=None)

dataframe with predicted probabilities and precision

Parameters
  • bin_size (float, optional, optional) – group predictions in bins of size bin_size, defaults to 0.1

  • quantiles (int, optional, optional) – group predictions in evenly sized quantiles of size quantiles, defaults to None

  • multiclass (bool, optional, optional) – whether to calculate precision for every class (Default value = False)

  • round – (Default value = 3)

  • pos_label – (Default value = None)

Returns

precision_df

Return type

pd.DataFrame

get_liftcurve_df

ClassifierExplainer.get_liftcurve_df(pos_label=None)

returns a pd.DataFrame with data needed to build a lift curve

Parameters

pos_label – (Default value = None)

Returns:

get_classification_df

ClassifierExplainer.get_classification_df(cutoff=0.5, pos_label=None)

Returns a dataframe with number of observations in each class above and below the cutoff.

Parameters
  • cutoff (float, optional) – Cutoff to split on. Defaults to 0.5.

  • pos_label (int, optional) – Pos label to generate dataframe for. Defaults to self.pos_label.

Returns

pd.DataFrame

roc_auc_curve

ClassifierExplainer.roc_auc_curve(pos_label=None)

Returns a dict with output from sklearn.metrics.roc_curve() for pos_label: fpr, tpr, thresholds, score

pr_auc_curve

ClassifierExplainer.pr_auc_curve(pos_label=None)

Returns a dict with output from sklearn.metrics.precision_recall_curve() for pos_label: fpr, tpr, thresholds, score

confusion_matrix

ClassifierExplainer.confusion_matrix(cutoff=0.5, binary=True, pos_label=None)

Regression outputs

For RegressionExplainer:

random_index(y_min=None, y_max=None, pred_min=None, pred_max=None,
                residuals_min=None, residuals_max=None,
                abs_residuals_min=None, abs_residuals_max=None,
                return_str=False)

random_index

RegressionExplainer.random_index(y_min=None, y_max=None, pred_min=None, pred_max=None, residuals_min=None, residuals_max=None, abs_residuals_min=None, abs_residuals_max=None, return_str=False, **kwargs)

random index following to various exclusion criteria

Parameters
  • y_min – (Default value = None)

  • y_max – (Default value = None)

  • pred_min – (Default value = None)

  • pred_max – (Default value = None)

  • residuals_min – (Default value = None)

  • residuals_max – (Default value = None)

  • abs_residuals_min – (Default value = None)

  • abs_residuals_max – (Default value = None)

  • return_str – return the str index from self.idxs (Default value = False)

  • **kwargs

Returns

a random index that fits the exclusion criteria

RandomForest and XGBoost outputs

For RandomForest and XGBoost models mixin classes that visualize individual decision trees will be loaded: RandomForestExplainer and XGBExplainer with the following additional methods:

decisiontree_df(tree_idx, index, pos_label=None)
decisiontree_summary_df(tree_idx, index, round=2, pos_label=None)
decision_path_file(tree_idx, index)
decision_path_encoded(tree_idx, index)
decision_path(tree_idx, index)

get_decisionpath_df

RandomForestExplainer.get_decisionpath_df(tree_idx, index, pos_label=None)

dataframe with all decision nodes of a particular decision tree for a particular observation.

Parameters
  • tree_idx – the n’th tree in the random forest

  • index – row index

  • pos_label – positive class (Default value = None)

Returns

dataframe with summary of the decision tree path

get_decisionpath_summary_df

RandomForestExplainer.get_decisionpath_summary_df(tree_idx, index, round=2, pos_label=None)

formats decisiontree_df in a slightly more human readable format.

Parameters
  • tree_idx – the n’th tree in the random forest or boosted ensemble

  • index – index

  • round – rounding to apply to floats (Default value = 2)

  • pos_label – positive class (Default value = None)

Returns

dataframe with summary of the decision tree path

decisiontree_file

RandomForestExplainer.decisiontree_file(tree_idx, index, show_just_path=False)

decisiontree_encoded

RandomForestExplainer.decisiontree_encoded(tree_idx, index, show_just_path=False)

get a dtreeviz visualization of a particular tree in the random forest.

Parameters
  • tree_idx – the n’th tree in the random forest

  • index – row index

  • show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a base64 encoded image, for inclusion in websites (e.g. dashboard)

decisiontree

RandomForestExplainer.decisiontree(tree_idx, index, show_just_path=False)

get a dtreeviz visualization of a particular tree in the random forest.

Parameters
  • tree_idx – the n’th tree in the random forest

  • index – row index

  • show_just_path (bool, optional) – show only the path not rest of the tree. Defaults to False.

Returns

a IPython display SVG object for e.g. jupyter notebook.

Calculated Properties

In general Explainers don’t calculate any properties of the model or the data until they are needed for an output, so-called lazy calculation. When the property is calculated once, it is stored for next time. So the first time you invoke a plot involving shap values may take a while to calculate. The next time will be basically instant.

You can access these properties directly from the explainer, e.g. explainer.get_shap_values_df(). For classifier models if you want values for a particular pos_label you can pass this label explainer.get_shap_values_df(0) would get the shap values for the 0’th class label.

In order to calculate all properties of the explainer at once, you can call explainer.calculate_properties(). (ExplainerComponents have a similar method component.calculate_dependencies() to calculate all properties that that specific component will need).

The various properties are:

explainer.preds
explainer.pred_percentiles
explainer.permutation_importances(pos_label)
explainer.mean_abs_shap_df(pos_label)
explainer.shap_base_value(pos_label)
explainer.get_shap_values_df(pos_label)
explainer.shap_interaction_values

For ClassifierExplainer:

explainer.y_binary
explainer.pred_probas_raw
explainer.pred_percentiles_raw
explainer.pred_probas(pos_label)
explainer.roc_auc_curve(pos_label)
explainer.pr_auc_curve(pos_label)
explainer.get_classification_df(cutoff, pos_label)
explainer.get_liftcurve_df(pos_label)
explainer.confusion_matrix(cutoff, binary, pos_label)

For RegressionExplainer:

explainer.residuals
explainer.abs_residuals

Setting pos_label

For ClassifierExplainer you can calculate most properties for multiple labels as the positive label. With a binary classification usually label ‘1’ is the positive class, but in some cases you might also be interested in the ‘0’ label.

For multiclass classification you may want to investigate shap dependences for the various classes.

You can pass a parameter pos_label to almost every property or method, to get the output for that specific positive label. If you don’t pass a pos_label manually to a specific method, the global self.pos_label will be used. You can set this directly on the explainer (even us str labels if you have set these):

explainer.pos_label = 0
explainer.plot_dependence("Fare") # will show plot for pos_label=0
explainer.pos_label = 'Survived'
explainer.plot_dependence("Fare") # will now show plot for pos_label=1
explainer.plot_dependence("Fare", pos_label=0) # show plot for label 0, without changing explainer.pos_label

The ExplainerDashboard will show a dropdown menu in the header to choose a particular pos_label. Changing this will basically update every single plot in the dashboard.

BaseExplainer

class explainerdashboard.explainers.BaseExplainer(model, X, y=None, permutation_metric=sklearn.metrics.r2_score, shap='guess', X_background=None, model_output='raw', cats=None, cats_notencoded=None, idxs=None, index_name=None, target=None, descriptions=None, n_jobs=None, permutation_cv=None, cv=None, na_fill=-999, precision='float64', shap_kwargs=None)

Defines the basic functionality that is shared by both ClassifierExplainer and RegressionExplainer.

Parameters
  • model – a model with a scikit-learn compatible .fit and .predict methods

  • X (pd.DataFrame) – a pd.DataFrame with your model features

  • y (pd.Series) – Dependent variable of your model, defaults to None

  • permutation_metric (function or str) – is a scikit-learn compatible metric function (or string). Defaults to r2_score

  • shap (str) – type of shap_explainer to fit: ‘tree’, ‘linear’, ‘kernel’. Defaults to ‘guess’.

  • X_background (pd.DataFrame) – background X to be used by shap explainers that need a background dataset (e.g. shap.KernelExplainer or shap.TreeExplainer with boosting models and model_output=’probability’).

  • model_output (str) – model_output of shap values, either ‘raw’, ‘logodds’ or ‘probability’. Defaults to ‘raw’ for regression and ‘probability’ for classification.

  • cats ({dict, list}) – dict of features that have been onehotencoded. e.g. cats={‘Sex’:[‘Sex_male’, ‘Sex_female’]}. If all encoded columns are underscore-seperated (as above), can simply pass a list of prefixes: cats=[‘Sex’]. Allows to group onehot encoded categorical variables together in various plots. Defaults to None.

  • cats_notencoded (dict) – value to display when all onehot encoded columns are equal to zero. Defaults to ‘NOT_ENCODED’ for each onehot col.

  • idxs (pd.Series) – list of row identifiers. Can be names, id’s, etc. Defaults to X.index.

  • index_name (str) – identifier for row indexes. e.g. index_name=’Passenger’. Defaults to X.index.name or idxs.name.

  • target (Optional[str]) – name of the predicted target, e.g. “Survival”, “Ticket price”, etc. Defaults to y.name.

  • n_jobs (int) – for jobs that can be parallelized using joblib, how many processes to split the job in. For now only used for calculating permutation importances. Defaults to None.

  • permutation_cv (int) – Deprecated! Use parameter cv instead! (now also works for calculating metrics)

  • cv (int) – If not None then permutation importances and metrics will get calculated using cross validation across X. Use this when you are passing the training set to the explainer. Defaults to None.

  • na_fill (int) – The filler used for missing values, defaults to -999.

  • precision (str) – precision with which to store values. Defaults to “float64”.

  • shap_kwargs (dict) – dictionary of keyword arguments to be passed to the shap explainer. most typically used to supress an additivity check e.g. shap_kwargs=dict(check_additivity=False)

get_permutation_importances_df(topx=None, cutoff=None, pos_label=None)

dataframe with features ordered by permutation importance.

For more about permutation importances.

see https://explained.ai/rf-importance/index.html

Parameters
  • topx (int, optional, optional) – only return topx most important features, defaults to None

  • cutoff (float, optional, optional) – only return features with importance of at least cutoff, defaults to None

  • pos_label – (Default value = None)

Returns

importance_df

Return type

pd.DataFrame

get_shap_values_df(pos_label=None)

SHAP values calculated using the shap library

set_shap_values(base_value, shap_values)

Set shap values manually. This is useful if you already have shap values calculated, and do not want to calculate them again inside the explainer instance. Especially for large models and large datasets you may want to calculate shap values on specialized hardware, and then add them to the explainer manually.

Parameters
  • base_value (float) – the shap intercept generated by e.g. base_value = shap.TreeExplainer(model).shap_values(X_test).expected_value

  • shap_values (np.ndarray]) – Generated by e.g. shap_values = shap.TreeExplainer(model).shap_values(X_test)

set_shap_interaction_values(shap_interaction_values)

Manually set shap interaction values in case you have already pre-computed these elsewhere and do not want to re-calculate them again inside the explainer instance.

Parameters

shap_interaction_values (np.ndarray) – shap interactions values of shape (n, m, m)

get_mean_abs_shap_df(topx=None, cutoff=None, pos_label=None)

sorted dataframe with mean_abs_shap

returns a pd.DataFrame with the mean absolute shap values per features, sorted rom highest to lowest.

Parameters
  • topx (int, optional, optional) – Only return topx most importance features, defaults to None

  • cutoff (float, optional, optional) – Only return features with mean abs shap of at least cutoff, defaults to None

  • pos_label – (Default value = None)

Returns

shap_df

Return type

pd.DataFrame

get_importances_df(kind='shap', topx=None, cutoff=None, pos_label=None)

wrapper function for get_mean_abs_shap_df() and get_permutation_importance_df()

Parameters
  • kind (str) – ‘shap’ or ‘permutations’ (Default value = “shap”)

  • topx – only display topx highest features (Default value = None)

  • cutoff – only display features above cutoff (Default value = None)

  • pos_label – Positive class (Default value = None)

Returns

pd.DataFrame

plot_importances(kind='shap', topx=None, round=3, pos_label=None)

plot barchart of importances in descending order.

Parameters
  • type (str, optional) – shap’ for mean absolute shap values, ‘permutation’ for permutation importances, defaults to ‘shap’

  • topx (int, optional, optional) – Only return topx features, defaults to None

  • kind – (Default value = ‘shap’)

  • round – (Default value = 3)

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.fig

plot_importances_detailed(highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)

Plot barchart of mean absolute shap value.

Displays all individual shap value for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters
  • highlight_index (str or int) – index to highlight

  • topx (int, optional) – Only display topx most important features, defaults to None

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • pos_label – positive class (Default value = None)

Returns

plotly.Fig

plot_contributions(index=None, X_row=None, topx=None, cutoff=None, sort='abs', orientation='vertical', higher_is_better=True, round=2, pos_label=None)

plot waterfall plot of shap value contributions to the model prediction for index.

Parameters
  • index (int or str) – index for which to display prediction

  • X_row (pd.DataFrame single row) – a single row of a features to plot shap contributions for. Can use this instead of index for what-if scenarios.

  • topx (int, optional, optional) – Only display topx features, defaults to None

  • cutoff (float, optional, optional) – Only display features with at least cutoff contribution, defaults to None

  • sort ({'abs', 'high-to-low', 'low-to-high', 'importance'}, optional) – sort by absolute shap value, or from high to low, or low to high, or by order of shap feature importance. Defaults to ‘abs’.

  • orientation ({'vertical', 'horizontal'}) – Horizontal or vertical bar chart. Horizontal may be better if you have lots of features. Defaults to ‘vertical’.

  • higher_is_better (bool) – if True, up=green, down=red. If false reversed. Defaults to True.

  • round (int, optional, optional) – round contributions to round precision, defaults to 2

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

plot_dependence(col, color_col=None, highlight_index=None, topx=None, sort='alphabet', max_cat_colors=5, round=3, plot_sample=None, remove_outliers=False, pos_label=None)

plot shap dependence

Plots a shap dependence plot:
  • on the x axis the possible values of the feature col

  • on the y axis the associated individual shap values

Parameters
  • col (str) – feature to be displayed

  • color_col (str) – if color_col provided then shap values colored (blue-red) according to feature color_col (Default value = None)

  • highlight_index – individual observation to be highlighed in the plot. (Default value = None)

  • topx (int, optional) – for categorical features only display topx categories.

  • sort (str) – for categorical features, how to sort the categories: alphabetically ‘alphabet’, most frequent first ‘freq’, highest mean absolute value first ‘shap’. Defaults to ‘alphabet’.

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • round (int, optional) – rounding to apply to floats. Defaults to 3.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.

  • pos_label – positive class (Default value = None)

Returns:

plot_interaction(col, interact_col, highlight_index=None, topx=10, sort='alphabet', max_cat_colors=5, plot_sample=None, remove_outliers=False, pos_label=None)

plots a dependence plot for shap interaction effects

Parameters
  • col (str) – feature for which to find interaction values

  • interact_col (str) – feature for which interaction value are displayed

  • highlight_index (str, optional) – index that will be highlighted, defaults to None

  • topx (int, optional) – number of categorical features to display in violin plots.

  • sort (str, optional) – how to sort categorical features in violin plots. Should be in {‘alphabet’, ‘freq’, ‘shap’}.

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • remove_outliers (bool, optional) – remove observations that are >1.5*IQR in either col or color_col. Defaults to False.

  • pos_label – (Default value = None)

Returns

Plotly Fig

Return type

plotly.Fig

plot_interactions_detailed(col, highlight_index=None, topx=None, max_cat_colors=5, plot_sample=None, pos_label=None)

Plot barchart of mean absolute shap interaction values

Displays all individual shap interaction values for each feature in a horizontal scatter chart in descending order by mean absolute shap value.

Parameters
  • col (type]) – feature for which to show interactions summary

  • highlight_index (str or int) – index to highlight

  • topx (int, optional) – only show topx most important features, defaults to None

  • max_cat_colors (int, optional) – for categorical features, maximum number of categories to label with own color. Defaults to 5.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • pos_label – positive class (Default value = None)

Returns

fig

plot_pdp(col, index=None, X_row=None, drop_na=True, sample=100, gridlines=100, gridpoints=10, sort='freq', round=2, pos_label=None)

plot partial dependence plot (pdp)

returns plotly fig for a partial dependence plot showing ice lines for num_grid_lines rows, average pdp based on sample of sample. If index is given, display pdp for this specific index.

Parameters
  • col (str) – feature to display pdp graph for

  • index (int or str, optional, optional) – index to highlight in pdp graph, defaults to None

  • X_row (pd.Dataframe, single row, optional) – a row of features to highlight predictions for. Alternative to passing index.

  • drop_na (bool, optional, optional) – if true drop samples with value equal to na_fill, defaults to True

  • sample (int, optional, optional) – sample size on which the average pdp will be calculated, defaults to 100

  • gridlines (int, optional) – number of ice lines to display, defaults to 100

  • gridpoints(ints – int, optional): number of points on the x axis to calculate the pdp for, defaults to 10

  • sort (str, optional) – For categorical features: how to sort: ‘alphabet’, ‘freq’, ‘shap’. Defaults to ‘freq’.

  • round (int, optional) – round float prediction to number of digits. Defaults to 2.

  • pos_label – (Default value = None)

Returns

fig

Return type

plotly.Fig

ClassifierExplainer

For classification (e.g. RandomForestClassifier) models you use ClassifierExplainer.

You can pass an additional parameter to __init__() with a list of label names. For multilabel classifier you can set the positive class with e.g. explainer.pos_label=1. This will make sure that for example explainer.pred_probas will return the probability of that label.

More examples in the notebook on the github repo.

class explainerdashboard.explainers.ClassifierExplainer(model, X, y=None, permutation_metric=sklearn.metrics.roc_auc_score, shap='guess', X_background=None, model_output='probability', cats=None, cats_notencoded=None, idxs=None, index_name=None, target=None, descriptions=None, n_jobs=None, permutation_cv=None, cv=None, na_fill=-999, precision='float64', shap_kwargs=None, labels=None, pos_label=1)

Explainer for classification models. Defines the shap values for each possible class in the classification.

You assign the positive label class afterwards with e.g. explainer.pos_label=0

In addition defines a number of plots specific to classification problems such as a precision plot, confusion matrix, roc auc curve and pr auc curve.

Compared to BaseExplainer defines two additional parameters

Parameters
  • model – a model with a scikit-learn compatible .fit and .predict methods

  • X (pd.DataFrame) – a pd.DataFrame with your model features

  • y (pd.Series) – Dependent variable of your model, defaults to None

  • permutation_metric (function or str) – is a scikit-learn compatible metric function (or string). Defaults to r2_score

  • shap (str) – type of shap_explainer to fit: ‘tree’, ‘linear’, ‘kernel’. Defaults to ‘guess’.

  • X_background (pd.DataFrame) – background X to be used by shap explainers that need a background dataset (e.g. shap.KernelExplainer or shap.TreeExplainer with boosting models and model_output=’probability’).

  • model_output (str) – model_output of shap values, either ‘raw’, ‘logodds’ or ‘probability’. Defaults to ‘raw’ for regression and ‘probability’ for classification.

  • cats ({dict, list}) – dict of features that have been onehotencoded. e.g. cats={‘Sex’:[‘Sex_male’, ‘Sex_female’]}. If all encoded columns are underscore-seperated (as above), can simply pass a list of prefixes: cats=[‘Sex’]. Allows to group onehot encoded categorical variables together in various plots. Defaults to None.

  • cats_notencoded (dict) – value to display when all onehot encoded columns are equal to zero. Defaults to ‘NOT_ENCODED’ for each onehot col.

  • idxs (pd.Series) – list of row identifiers. Can be names, id’s, etc. Defaults to X.index.

  • index_name (str) – identifier for row indexes. e.g. index_name=’Passenger’. Defaults to X.index.name or idxs.name.

  • target (Optional[str]) – name of the predicted target, e.g. “Survival”, “Ticket price”, etc. Defaults to y.name.

  • n_jobs (int) – for jobs that can be parallelized using joblib, how many processes to split the job in. For now only used for calculating permutation importances. Defaults to None.

  • permutation_cv (int) – Deprecated! Use parameter cv instead! (now also works for calculating metrics)

  • cv (int) – If not None then permutation importances and metrics will get calculated using cross validation across X. Use this when you are passing the training set to the explainer. Defaults to None.

  • na_fill (int) – The filler used for missing values, defaults to -999.

  • precision (str) – precision with which to store values. Defaults to “float64”.

  • shap_kwargs (dict) – dictionary of keyword arguments to be passed to the shap explainer. most typically used to supress an additivity check e.g. shap_kwargs=dict(check_additivity=False)

  • labels (list) – list of str labels for the different classes, defaults to e.g. [‘0’, ‘1’] for a binary classification

  • pos_label (int) – class that should be used as the positive class, defaults to 1

set_shap_values(base_value, shap_values)

Set shap values manually. This is useful if you already have shap values calculated, and do not want to calculate them again inside the explainer instance. Especially for large models and large datasets you may want to calculate shap values on specialized hardware, and then add them to the explainer manually.

Parameters
  • base_value (list[float]) – list of shap intercept generated by e.g. base_value = shap.TreeExplainer(model).shap_values(X_test).expected_value. Should be a list with a float for each class. For binary classification and some models shap only provides the base value for the positive class, in which case you need to provide [1-base_value, base_value] or [-base_value, base_value] depending on whether the shap values are for probabilities or logodds.

  • shap_values (list[np.ndarray]) – Generated by e.g. shap_values = shap.TreeExplainer(model).shap_values(X_test) For binary classification and some models shap only provides the shap values for the positive class, in which case you need to provide [1-shap_values, shap_values] or [-shap_values, shap_values] depending on whether the shap values are for probabilities or logodds.

set_shap_interaction_values(shap_interaction_values)

Manually set shap interaction values in case you have already pre-computed these elsewhere and do not want to re-calculate them again inside the explainer instance.

Parameters

shap_interaction_values (np.ndarray) – shap interactions values of shape (n, m, m)

random_index(y_values=None, return_str=False, pred_proba_min=None, pred_proba_max=None, pred_percentile_min=None, pred_percentile_max=None, pos_label=None)

random index satisfying various constraint

Parameters
  • y_values – list of labels to include (Default value = None)

  • return_str – return str from self.idxs (Default value = False)

  • pred_proba_min – minimum pred_proba (Default value = None)

  • pred_proba_max – maximum pred_proba (Default value = None)

  • pred_percentile_min – minimum pred_proba percentile (Default value = None)

  • pred_percentile_max – maximum pred_proba percentile (Default value = None)

  • pos_label – positive class (Default value = None)

Returns

index

get_precision_df(bin_size=None, quantiles=None, multiclass=False, round=3, pos_label=None)

dataframe with predicted probabilities and precision

Parameters
  • bin_size (float, optional, optional) – group predictions in bins of size bin_size, defaults to 0.1

  • quantiles (int, optional, optional) – group predictions in evenly sized quantiles of size quantiles, defaults to None

  • multiclass (bool, optional, optional) – whether to calculate precision for every class (Default value = False)

  • round – (Default value = 3)

  • pos_label – (Default value = None)

Returns

precision_df

Return type

pd.DataFrame

get_liftcurve_df(pos_label=None)

returns a pd.DataFrame with data needed to build a lift curve

Parameters

pos_label – (Default value = None)

Returns:

get_classification_df(cutoff=0.5, pos_label=None)

Returns a dataframe with number of observations in each class above and below the cutoff.

Parameters
  • cutoff (float, optional) – Cutoff to split on. Defaults to 0.5.

  • pos_label (int, optional) – Pos label to generate dataframe for. Defaults to self.pos_label.

Returns

pd.DataFrame

plot_precision(bin_size=None, quantiles=None, cutoff=None, multiclass=False, pos_label=None)

plot precision vs predicted probability

plots predicted probability on the x-axis and observed precision (fraction of actual positive cases) on the y-axis.

Should pass either bin_size fraction of number of quantiles, but not both.

Parameters
  • bin_size (float, optional) – size of the bins on x-axis (e.g. 0.05 for 20 bins)

  • quantiles (int, optional) – number of equal sized quantiles to split the predictions by e.g. 20, optional)

  • cutoff – cutoff of model to include in the plot (Default value = None)

  • multiclass – whether to display all classes or only positive class, defaults to False

  • pos_label – positive label to display, defaults to self.pos_label

Returns

Plotly fig

plot_cumulative_precision(percentile=None, pos_label=None)

plot cumulative precision

returns a cumulative precision plot, which is a slightly different representation of a lift curve.

Parameters

pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_confusion_matrix(cutoff=0.5, percentage=False, normalize='all', binary=False, pos_label=None)

plot of a confusion matrix.

Parameters
  • cutoff (float, optional, optional) – cutoff of positive class to calculate confusion matrix for, defaults to 0.5

  • percentage (bool, optional, optional) – display percentages instead of counts , defaults to False

  • normalize (str[‘observed’, ‘pred’, ‘all’]) – normalizes confusion matrix over the observed (rows), predicted (columns) conditions or all the population. Defaults to all.

  • binary (bool, optional, optional) – if multiclass display one-vs-rest instead, defaults to False

  • pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_lift_curve(cutoff=None, percentage=False, add_wizard=True, round=2, pos_label=None)

plot of a lift curve.

Parameters
  • cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = None)

  • percentage (bool, optional) – display percentages instead of counts, defaults to False

  • add_wizard (bool, optional) – Add a line indicating how a perfect model would perform (“the wizard”). Defaults to True.

  • round – number of digits to round to (Default value = 2)

  • pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_classification(cutoff=0.5, percentage=True, pos_label=None)

plot showing a barchart of the classification result for cutoff

Parameters
  • cutoff (float, optional) – cutoff of positive class to calculate lift (Default value = 0.5)

  • percentage (bool, optional) – display percentages instead of counts, defaults to True

  • pos_label – positive label to display, defaults to self.pos_label

Returns

plotly fig

plot_roc_auc(cutoff=0.5, pos_label=None)

plots ROC_AUC curve.

The TPR and FPR of a particular cutoff is displayed in crosshairs.

Parameters
  • cutoff – cutoff value to be included in plot (Default value = 0.5)

  • pos_label – (Default value = None)

Returns:

plot_pr_auc(cutoff=0.5, pos_label=None)

plots PR_AUC curve.

the precision and recall of particular cutoff is displayed in crosshairs.

Parameters
  • cutoff – cutoff value to be included in plot (Default value = 0.5)

  • pos_label – (Default value = None)

Returns:

RegressionExplainer

For regression models (e.g. RandomForestRegressor) models you use RegressionExplainer.

You can pass units as an additional parameter for the units of the target variable (e.g. units="$").

More examples in the notebook on the github repo.

class explainerdashboard.explainers.RegressionExplainer(model, X, y=None, permutation_metric=sklearn.metrics.r2_score, shap='guess', X_background=None, model_output='raw', cats=None, cats_notencoded=None, idxs=None, index_name=None, target=None, descriptions=None, n_jobs=None, permutation_cv=None, cv=None, na_fill=-999, precision='float64', shap_kwargs=None, units='')

Explainer for regression models.

In addition to BaseExplainer defines a number of plots specific to regression problems such as a predicted vs actual and residual plots.

Combared to BaseExplainerBunch defines two additional parameters.

Parameters
  • model – a model with a scikit-learn compatible .fit and .predict methods

  • X (pd.DataFrame) – a pd.DataFrame with your model features

  • y (pd.Series) – Dependent variable of your model, defaults to None

  • permutation_metric (function or str) – is a scikit-learn compatible metric function (or string). Defaults to r2_score

  • shap (str) – type of shap_explainer to fit: ‘tree’, ‘linear’, ‘kernel’. Defaults to ‘guess’.

  • X_background (pd.DataFrame) – background X to be used by shap explainers that need a background dataset (e.g. shap.KernelExplainer or shap.TreeExplainer with boosting models and model_output=’probability’).

  • model_output (str) – model_output of shap values, either ‘raw’, ‘logodds’ or ‘probability’. Defaults to ‘raw’ for regression and ‘probability’ for classification.

  • cats ({dict, list}) – dict of features that have been onehotencoded. e.g. cats={‘Sex’:[‘Sex_male’, ‘Sex_female’]}. If all encoded columns are underscore-seperated (as above), can simply pass a list of prefixes: cats=[‘Sex’]. Allows to group onehot encoded categorical variables together in various plots. Defaults to None.

  • cats_notencoded (dict) – value to display when all onehot encoded columns are equal to zero. Defaults to ‘NOT_ENCODED’ for each onehot col.

  • idxs (pd.Series) – list of row identifiers. Can be names, id’s, etc. Defaults to X.index.

  • index_name (str) – identifier for row indexes. e.g. index_name=’Passenger’. Defaults to X.index.name or idxs.name.

  • target (Optional[str]) – name of the predicted target, e.g. “Survival”, “Ticket price”, etc. Defaults to y.name.

  • n_jobs (int) – for jobs that can be parallelized using joblib, how many processes to split the job in. For now only used for calculating permutation importances. Defaults to None.

  • permutation_cv (int) – Deprecated! Use parameter cv instead! (now also works for calculating metrics)

  • cv (int) – If not None then permutation importances and metrics will get calculated using cross validation across X. Use this when you are passing the training set to the explainer. Defaults to None.

  • na_fill (int) – The filler used for missing values, defaults to -999.

  • precision (str) – precision with which to store values. Defaults to “float64”.

  • shap_kwargs (dict) – dictionary of keyword arguments to be passed to the shap explainer. most typically used to supress an additivity check e.g. shap_kwargs=dict(check_additivity=False)

  • units (str) – units to display for regression quantity

property residuals

y-preds

Type

residuals

random_index(y_min=None, y_max=None, pred_min=None, pred_max=None, residuals_min=None, residuals_max=None, abs_residuals_min=None, abs_residuals_max=None, return_str=False, **kwargs)

random index following to various exclusion criteria

Parameters
  • y_min – (Default value = None)

  • y_max – (Default value = None)

  • pred_min – (Default value = None)

  • pred_max – (Default value = None)

  • residuals_min – (Default value = None)

  • residuals_max – (Default value = None)

  • abs_residuals_min – (Default value = None)

  • abs_residuals_max – (Default value = None)

  • return_str – return the str index from self.idxs (Default value = False)

  • **kwargs

Returns

a random index that fits the exclusion criteria

metrics(show_metrics=None)

dict of performance metrics: root_mean_squared_error, mean_absolute_error and R-squared

Parameters

show_metrics (List) – list of metrics to display in order. Defaults to None, displaying all metrics.

plot_predicted_vs_actual(round=2, logs=False, log_x=False, log_y=False, plot_sample=None, **kwargs)

plot with predicted value on x-axis and actual value on y axis.

Parameters
  • round (int, optional) – rounding to apply to outcome, defaults to 2

  • logs (bool, optional) – log both x and y axis, defaults to False

  • log_y (bool, optional) – only log x axis. Defaults to False.

  • log_x (bool, optional) – only log y axis. Defaults to False.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

  • **kwargs

Returns

Plotly fig

plot_residuals(vs_actual=False, round=2, residuals='difference', plot_sample=None)

plot of residuals. x-axis is the predicted outcome by default

Parameters
  • vs_actual (bool, optional) – use actual value for x-axis, defaults to False

  • round (int, optional) – rounding to perform on values, defaults to 2

  • residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

Plotly fig

plot_residuals_vs_feature(col, residuals='difference', round=2, dropna=True, points=True, winsor=0, topx=None, sort='alphabet', plot_sample=None)

Plot residuals vs individual features

Parameters
  • col (str) – Plot against feature col

  • residuals (str, {'difference', 'ratio', 'log-ratio'} optional) – How to calcualte residuals. Defaults to ‘difference’.

  • round (int, optional) – rounding to perform on residuals, defaults to 2

  • dropna (bool, optional) – drop missing values from plot, defaults to True.

  • points (bool, optional) – display point cloud next to violin plot. Defaults to True.

  • winsor (int, 0-50, optional) – percentage of outliers to winsor out of the y-axis. Defaults to 0.

  • plot_sample (int, optional) – Instead of all points only plot a random sample of points. Defaults to None (=all points)

Returns

plotly fig