Generate Explainable Report with Titanic dataset using Contextual AI

This notebook demonstrates how to generate explanations report using complier implemented in the Contextual AI library.

Motivation

Once the PoC is done (and you know where your data comes from, what it looks like, and what it can predict) comes the ideal next step is to put your model into production and make it useful for the rest of the business.

Does it sound familiar? do you also need to answer the questions below, before promoting your model into production: 1. How you sure that your model is ready for production? 2. How you able to explain the model performance? in business context that non-technical management can understand? 3. How you able to compare between newly trained models and existing models is done manually every iteration?

In Contextual AI project, our simply vision is to: 1. Speed up data validation 2. Simplify model engineering 3. Build trust

For more details, please refer to our whitepaper

Steps

  1. Create a model to Predict survival on the Titanic, using the data provide in titanic

  2. Evaluate the model performance with Contextual AI report and generate a local explainer pkl

  3. Load the explainer pkl while inference and explain the instance


1. Performance Model Training

[1]:
import numpy as np
import pandas as pd
import re as re
import warnings

1.1 Loading Data

[2]:
data = pd.read_csv("titanic.csv")

data.head(10)
[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

1.2 Feature quantity engineering

[3]:
data['family_size'] = data['SibSp'] + data['Parch'] + 1
data['is_alone'] = 0
data.loc[data['family_size'] == 1, 'is_alone'] = 1

data['Embarked'] = data['Embarked'].fillna('S')

data['Fare'] = data['Fare'].fillna(data['Fare'].median())

age_avg  = data['Age'].mean()
age_std  = data['Age'].std()
age_null = data['Age'].isnull().sum()

random_list = np.random.randint(age_avg - age_std, age_avg + age_std , size = age_null)
data['Age'][np.isnan(data['Age'])] = random_list
data['Age'] = data['Age'].astype(int)

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\. ', name)
    if title_search:
        return title_search.group(1)
    return ""

data['title'] = data['Name'].apply(get_title)
data['title'] = data['title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],'Rare')
data['title'] = data['title'].replace('Mlle','Miss')
data['title'] = data['title'].replace('Ms','Miss')
data['title'] = data['title'].replace('Mme','Mrs')


#Mapping Sex
sex_map = { 'female':0 , 'male':1 }
data['Sex'] = data['Sex'].map(sex_map).astype(int)

#Mapping Title
title_map = {'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5}
data['title'] = data['title'].map(title_map)
data['title'] = data['title'].fillna(0)

#Mapping Embarked
embark_map = {'S':0, 'C':1, 'Q':2}
data['Embarked'] = data['Embarked'].map(embark_map).astype(int)

#Mapping Fare
data.loc[ data['Fare'] <= 7.91, 'Fare']                            = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare']   = 2
data.loc[ data['Fare'] > 31, 'Fare']                               = 3
data['Fare'] = data['Fare'].astype(int)

#Mapping Age
data.loc[ data['Age'] <= 16, 'Age']                       = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age']                        = 4

/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

1.3 Feature Selection

[4]:
# Create list of columns to drop
drop_elements = ["PassengerId", "Name", "Ticket", "Cabin", "SibSp", "Parch", "family_size"]

# Drop columns from both data sets
clean_data = data.drop(drop_elements, axis = 1)
X = clean_data.drop("Survived", axis=1)
y = clean_data["Survived"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

X_train.to_csv("train_data.csv", index=False)
X_train.head()
[4]:
Pclass Sex Age Fare Embarked is_alone title
615 2 0 1 3 0 0 2
13 3 1 2 3 0 0 1
509 3 1 1 3 0 1 1
494 3 1 1 1 0 1 1
470 3 1 1 0 0 1 1

1.4 ML train a RandomForest Model

[5]:
import pickle
from sklearn.ensemble import RandomForestClassifier

decision_tree = RandomForestClassifier(max_depth=5)
decision_tree.fit(X_train, y_train)

with open('model.pkl', 'wb') as pkl:
    pickle.dump(decision_tree, pkl)

with open('func.pkl', 'wb') as func_pkl:
    pickle.dump(decision_tree.predict_proba,func_pkl)

decision_tree = None
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

1.5 ML load model and evaluation

[6]:
with open('model.pkl', 'rb') as model_pkl:
    model = pickle.load(model_pkl)

train_accuracy = round(model.score(X_train, y_train) * 100, 2)
print("Model Training Accuracy: ", train_accuracy)
test_accuracy = round(model.score(X_test, y_test) * 100, 2)
print("Model Testing Accuracy: ", test_accuracy)
Model Training Accuracy:  83.43
Model Testing Accuracy:  81.56

1.6 ML Inference and output result

[7]:
y_conf = model.predict_proba(X_test)
np.savetxt("y_conf.csv", y_conf, delimiter=",")
np.savetxt("y_true.csv", y_test, delimiter=",")

2. Involve Contextual AI complier

[8]:
import os
import sys
from pprint import pprint
sys.path.append('../../../')
from xai.compiler.base import Configuration, Controller

2.1 Specify config file

[9]:
json_config = 'basic-report-explainer.json'

2.2 Initial compiler controller with config

[10]:
controller = Controller(config=Configuration(json_config))
pprint(controller.config)
{'content_table': True,
 'contents': [{'desc': 'This section summarized the training performance',
               'sections': [{'component': {'attr': {'labels_file': 'labels.json',
                                                    'y_pred_file': 'y_conf.csv',
                                                    'y_true_file': 'y_true.csv'},
                                           'class': 'ClassificationEvaluationResult',
                                           'module': 'compiler',
                                           'package': 'xai'},
                             'title': 'Training Result'}],
               'title': 'Training Result'},
              {'desc': 'This section provides the analysis on feature',
               'sections': [{'component': {'_comment': 'refer to document '
                                                       'section xxxx',
                                           'attr': {'train_data': 'train_data.csv',
                                                    'trained_model': 'model.pkl'},
                                           'class': 'FeatureImportanceRanking'},
                             'title': 'Feature Importance Ranking'}],
               'title': 'Feature Importance Analysis'},
              {'desc': 'This section provides a model-agnostic explainer',
               'sections': [{'component': {'attr': {'domain': 'tabular',
                                                    'feature_meta': 'feature_meta.json',
                                                    'method': 'lime',
                                                    'num_features': 5,
                                                    'predict_func': 'func.pkl',
                                                    'train_data': 'train_data.csv'},
                                           'class': 'ModelAgnosticExplainer',
                                           'module': 'compiler',
                                           'package': 'xai'},
                             'title': 'Result'}],
               'title': 'Model-Agnostic Explainer'},
              {'desc': 'This section provides the analysis on data',
               'sections': [{'component': {'_comment': 'refer to document '
                                                       'section xxxx',
                                           'attr': {'data': 'titanic.csv',
                                                    'label': 'Survived'},
                                           'class': 'DataStatisticsAnalysis'},
                             'title': 'Simple Data Statistic'}],
               'title': 'Data Statistics Analysis'}],
 'name': 'Report for Titanic Dataset',
 'overview': True,
 'writers': [{'attr': {'name': 'titanic-basic-report'}, 'class': 'Pdf'}]}

2.2 Finally compiler render

[11]:
 controller.render()
../../../xai/compiler/explainer.py:120: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  train_data = train_data.as_matrix()
../../../xai/data/helper.py:156: UserWarning: Warning: the feature [PassengerId] is suspected to be key feature as it is monotonic integer.
[Examples]: [1, 2, 3, 4, 5]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/data/helper.py:181: UserWarning: Warning: the feature [Ticket] is suspected to be identifiable feature.
[Examples]: ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450']

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/data/helper.py:181: UserWarning: Warning: the feature [Cabin] is suspected to be identifiable feature.
[Examples]: [nan, 'C85', nan, 'C123', nan]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/formatter/portable_document/publisher.py:639: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/graphs/graph_generator.py:47: RuntimeWarning: invalid value encountered in double_scalars
  ave_acc = np.sum(accuracy[condition]) / sample_num
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
../../../xai/formatter/portable_document/publisher.py:454: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/formatter/portable_document/publisher.py:459: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/formatter/portable_document/publisher.py:531: UserWarning: Warning: figure will exceed the page edge on the right, rescale the whole group.
  warnings.warn(message='Warning: figure will exceed the page edge '
../../../xai/formatter/portable_document/publisher.py:551: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '

Result

[12]:
print("report generated : %s/titanic-basic-report.pdf" % os.getcwd())
report generated : /Users/i309943/workspace/Explainable_AI/tutorials/compiler/titanic2/titanic-basic-report.pdf
[13]:
print("explainer generated : %s/explainer.pkl" % os.getcwd())
explainer generated : /Users/i309943/workspace/Explainable_AI/tutorials/compiler/titanic2/explainer.pkl

Inference Explainer

[14]:
import xai
from xai.explainer.explainer_factory import ExplainerFactory
from pprint import pprint

explainer = ExplainerFactory.get_explainer(domain=xai.DOMAIN.TABULAR, algorithm=xai.ALG.LIME)
explainer.load_explainer('explainer.pkl')
explanations = explainer.explain_instance(instance=X_test.values[0,:],num_features=5)
pprint(explanations)
{0: {'explanation': [{'feature': 'Sex=male', 'score': 0.2667696553126673},
                     {'feature': 'Embarked=S', 'score': 0.032281049021469756},
                     {'feature': 'is_alone=Yes',
                      'score': -0.030266655575790963},
                     {'feature': 'Pclass=2', 'score': -0.056222958837791485},
                     {'feature': 'title=Rare', 'score': -0.07360802296209705}],
     'prediction': 0.8584999999999999},
 1: {'explanation': [{'feature': 'title=Rare', 'score': 0.07360802296209706},
                     {'feature': 'Pclass=2', 'score': 0.05622295883779148},
                     {'feature': 'is_alone=Yes', 'score': 0.030266655575790943},
                     {'feature': 'Embarked=S', 'score': -0.03228104902146975},
                     {'feature': 'Sex=male', 'score': -0.2667696553126673}],
     'prediction': 0.14150000000000001}}