Generate Explainable Report with Titanic dataset using Contextual AI¶

This notebook demonstrates how to generate explanations report using complier implemented in the Contextual AI library.

Motivation¶

Once the PoC is done (and you know where your data comes from, what it looks like, and what it can predict) comes the ideal next step is to put your model into production and make it useful for the rest of the business.

Does it sound familiar? do you also need to answer the questions below, before promoting your model into production: 1. How you sure that your model is ready for production? 2. How you able to explain the model performance? in business context that non-technical management can understand? 3. How you able to compare between newly trained models and existing models is done manually every iteration?

In Contextual AI project, our simply vision is to: 1. Speed up data validation 2. Simplify model engineering 3. Build trust

For more details, please refer to our whitepaper

Steps¶

Create a model to Predict survival on the Titanic, using the data provide in titanic
Evaluate the model performance with Contextual AI report and generate a local explainer pkl
Load the explainer pkl while inference and explain the instance

1. Performance Model Training¶

[1]:

import numpy as np
import pandas as pd
import re as re
import warnings

1.1 Loading Data¶

[2]:

data = pd.read_csv("titanic.csv")

data.head(10)

[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

1.2 Feature quantity engineering¶

[3]:

data['family_size'] = data['SibSp'] + data['Parch'] + 1
data['is_alone'] = 0
data.loc[data['family_size'] == 1, 'is_alone'] = 1

data['Embarked'] = data['Embarked'].fillna('S')

data['Fare'] = data['Fare'].fillna(data['Fare'].median())

age_avg  = data['Age'].mean()
age_std  = data['Age'].std()
age_null = data['Age'].isnull().sum()

random_list = np.random.randint(age_avg - age_std, age_avg + age_std , size = age_null)
data['Age'][np.isnan(data['Age'])] = random_list
data['Age'] = data['Age'].astype(int)

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\. ', name)
    if title_search:
        return title_search.group(1)
    return ""

data['title'] = data['Name'].apply(get_title)
data['title'] = data['title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],'Rare')
data['title'] = data['title'].replace('Mlle','Miss')
data['title'] = data['title'].replace('Ms','Miss')
data['title'] = data['title'].replace('Mme','Mrs')


#Mapping Sex
sex_map = { 'female':0 , 'male':1 }
data['Sex'] = data['Sex'].map(sex_map).astype(int)

#Mapping Title
title_map = {'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5}
data['title'] = data['title'].map(title_map)
data['title'] = data['title'].fillna(0)

#Mapping Embarked
embark_map = {'S':0, 'C':1, 'Q':2}
data['Embarked'] = data['Embarked'].map(embark_map).astype(int)

#Mapping Fare
data.loc[ data['Fare'] <= 7.91, 'Fare']                            = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare']   = 2
data.loc[ data['Fare'] > 31, 'Fare']                               = 3
data['Fare'] = data['Fare'].astype(int)

#Mapping Age
data.loc[ data['Age'] <= 16, 'Age']                       = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age']                        = 4

/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

1.3 Feature Selection¶

[4]:

# Create list of columns to drop
drop_elements = ["PassengerId", "Name", "Ticket", "Cabin", "SibSp", "Parch", "family_size"]

# Drop columns from both data sets
clean_data = data.drop(drop_elements, axis = 1)
X = clean_data.drop("Survived", axis=1)
y = clean_data["Survived"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

X_train.to_csv("train_data.csv", index=False)
X_train.head()

[4]:

	Pclass	Sex	Age	Fare	is_alone	title
615	2	0	1	3	0	2
13	3	1	2	3	0	1
509	3	1	1	3	1	1
494	3	1	1	1	1	1
470	3	1	1	0	1	1

1.4 ML train a RandomForest Model¶

[5]:

import pickle
from sklearn.ensemble import RandomForestClassifier

decision_tree = RandomForestClassifier(max_depth=5)
decision_tree.fit(X_train, y_train)

with open('model.pkl', 'wb') as pkl:
    pickle.dump(decision_tree, pkl)

with open('func.pkl', 'wb') as func_pkl:
    pickle.dump(decision_tree.predict_proba,func_pkl)

decision_tree = None

/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

1.5 ML load model and evaluation¶

[6]:

with open('model.pkl', 'rb') as model_pkl:
    model = pickle.load(model_pkl)

train_accuracy = round(model.score(X_train, y_train) * 100, 2)
print("Model Training Accuracy: ", train_accuracy)
test_accuracy = round(model.score(X_test, y_test) * 100, 2)
print("Model Testing Accuracy: ", test_accuracy)

Model Training Accuracy:  83.43
Model Testing Accuracy:  81.56

1.6 ML Inference and output result¶

[7]:

y_conf = model.predict_proba(X_test)
np.savetxt("y_conf.csv", y_conf, delimiter=",")
np.savetxt("y_true.csv", y_test, delimiter=",")

2. Involve Contextual AI complier¶

[8]:

import os
import sys
from pprint import pprint
sys.path.append('../../../')
from xai.compiler.base import Configuration, Controller

2.1 Specify config file¶

[9]:

json_config = 'basic-report-explainer.json'

2.2 Initial compiler controller with config¶

[10]:

controller = Controller(config=Configuration(json_config))
pprint(controller.config)

{'content_table': True,
 'contents': [{'desc': 'This section summarized the training performance',
               'sections': [{'component': {'attr': {'labels_file': 'labels.json',
                                                    'y_pred_file': 'y_conf.csv',
                                                    'y_true_file': 'y_true.csv'},
                                           'class': 'ClassificationEvaluationResult',
                                           'module': 'compiler',
                                           'package': 'xai'},
                             'title': 'Training Result'}],
               'title': 'Training Result'},
              {'desc': 'This section provides the analysis on feature',
               'sections': [{'component': {'_comment': 'refer to document '
                                                       'section xxxx',
                                           'attr': {'train_data': 'train_data.csv',
                                                    'trained_model': 'model.pkl'},
                                           'class': 'FeatureImportanceRanking'},
                             'title': 'Feature Importance Ranking'}],
               'title': 'Feature Importance Analysis'},
              {'desc': 'This section provides a model-agnostic explainer',
               'sections': [{'component': {'attr': {'domain': 'tabular',
                                                    'feature_meta': 'feature_meta.json',
                                                    'method': 'lime',
                                                    'num_features': 5,
                                                    'predict_func': 'func.pkl',
                                                    'train_data': 'train_data.csv'},
                                           'class': 'ModelAgnosticExplainer',
                                           'module': 'compiler',
                                           'package': 'xai'},
                             'title': 'Result'}],
               'title': 'Model-Agnostic Explainer'},
              {'desc': 'This section provides the analysis on data',
               'sections': [{'component': {'_comment': 'refer to document '
                                                       'section xxxx',
                                           'attr': {'data': 'titanic.csv',
                                                    'label': 'Survived'},
                                           'class': 'DataStatisticsAnalysis'},
                             'title': 'Simple Data Statistic'}],
               'title': 'Data Statistics Analysis'}],
 'name': 'Report for Titanic Dataset',
 'overview': True,
 'writers': [{'attr': {'name': 'titanic-basic-report'}, 'class': 'Pdf'}]}

2.2 Finally compiler render¶

[11]:

 controller.render()

../../../xai/compiler/explainer.py:120: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  train_data = train_data.as_matrix()
../../../xai/data/helper.py:156: UserWarning: Warning: the feature [PassengerId] is suspected to be key feature as it is monotonic integer.
[Examples]: [1, 2, 3, 4, 5]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/data/helper.py:181: UserWarning: Warning: the feature [Ticket] is suspected to be identifiable feature.
[Examples]: ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450']

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/data/helper.py:181: UserWarning: Warning: the feature [Cabin] is suspected to be identifiable feature.
[Examples]: [nan, 'C85', nan, 'C123', nan]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/formatter/portable_document/publisher.py:639: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/graphs/graph_generator.py:47: RuntimeWarning: invalid value encountered in double_scalars
  ave_acc = np.sum(accuracy[condition]) / sample_num
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
../../../xai/formatter/portable_document/publisher.py:454: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/formatter/portable_document/publisher.py:459: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/formatter/portable_document/publisher.py:531: UserWarning: Warning: figure will exceed the page edge on the right, rescale the whole group.
  warnings.warn(message='Warning: figure will exceed the page edge '
../../../xai/formatter/portable_document/publisher.py:551: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
  warnings.warn(message='Warning: figure will exceed the page bottom, '

Result¶

[12]:

print("report generated : %s/titanic-basic-report.pdf" % os.getcwd())

report generated : /Users/i309943/workspace/Explainable_AI/tutorials/compiler/titanic2/titanic-basic-report.pdf

[13]:

print("explainer generated : %s/explainer.pkl" % os.getcwd())

explainer generated : /Users/i309943/workspace/Explainable_AI/tutorials/compiler/titanic2/explainer.pkl

Inference Explainer¶

[14]:

import xai
from xai.explainer.explainer_factory import ExplainerFactory
from pprint import pprint

explainer = ExplainerFactory.get_explainer(domain=xai.DOMAIN.TABULAR, algorithm=xai.ALG.LIME)
explainer.load_explainer('explainer.pkl')
explanations = explainer.explain_instance(instance=X_test.values[0,:],num_features=5)
pprint(explanations)

{0: {'explanation': [{'feature': 'Sex=male', 'score': 0.2667696553126673},
                     {'feature': 'Embarked=S', 'score': 0.032281049021469756},
                     {'feature': 'is_alone=Yes',
                      'score': -0.030266655575790963},
                     {'feature': 'Pclass=2', 'score': -0.056222958837791485},
                     {'feature': 'title=Rare', 'score': -0.07360802296209705}],
     'prediction': 0.8584999999999999},
 1: {'explanation': [{'feature': 'title=Rare', 'score': 0.07360802296209706},
                     {'feature': 'Pclass=2', 'score': 0.05622295883779148},
                     {'feature': 'is_alone=Yes', 'score': 0.030266655575790943},
                     {'feature': 'Embarked=S', 'score': -0.03228104902146975},
                     {'feature': 'Sex=male', 'score': -0.2667696553126673}],
     'prediction': 0.14150000000000001}}