Generate Explainable Report with Titanic dataset using Contextual AI¶
This notebook demonstrates how to generate explanations report using complier implemented in the Contextual AI library.
Motivation¶
Once the PoC is done (and you know where your data comes from, what it looks like, and what it can predict) comes the ideal next step is to put your model into production and make it useful for the rest of the business.
Does it sound familiar? do you also need to answer the questions below, before promoting your model into production: 1. How you sure that your model is ready for production? 2. How you able to explain the model performance? in business context that non-technical management can understand? 3. How you able to compare between newly trained models and existing models is done manually every iteration?
In Contextual AI project, our simply vision is to: 1. Speed up data validation 2. Simplify model engineering 3. Build trust
For more details, please refer to our whitepaper
Steps¶
Create a model to Predict survival on the Titanic, using the data provide in titanic
Evaluate the model performance with Contextual AI report and generate a local explainer pkl
Load the explainer pkl while inference and explain the instance
1. Performance Model Training¶
[1]:
import numpy as np
import pandas as pd
import re as re
import warnings
1.1 Loading Data¶
[2]:
data = pd.read_csv("titanic.csv")
data.head(10)
[2]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
1.2 Feature quantity engineering¶
[3]:
data['family_size'] = data['SibSp'] + data['Parch'] + 1
data['is_alone'] = 0
data.loc[data['family_size'] == 1, 'is_alone'] = 1
data['Embarked'] = data['Embarked'].fillna('S')
data['Fare'] = data['Fare'].fillna(data['Fare'].median())
age_avg = data['Age'].mean()
age_std = data['Age'].std()
age_null = data['Age'].isnull().sum()
random_list = np.random.randint(age_avg - age_std, age_avg + age_std , size = age_null)
data['Age'][np.isnan(data['Age'])] = random_list
data['Age'] = data['Age'].astype(int)
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\. ', name)
if title_search:
return title_search.group(1)
return ""
data['title'] = data['Name'].apply(get_title)
data['title'] = data['title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],'Rare')
data['title'] = data['title'].replace('Mlle','Miss')
data['title'] = data['title'].replace('Ms','Miss')
data['title'] = data['title'].replace('Mme','Mrs')
#Mapping Sex
sex_map = { 'female':0 , 'male':1 }
data['Sex'] = data['Sex'].map(sex_map).astype(int)
#Mapping Title
title_map = {'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5}
data['title'] = data['title'].map(title_map)
data['title'] = data['title'].fillna(0)
#Mapping Embarked
embark_map = {'S':0, 'C':1, 'Q':2}
data['Embarked'] = data['Embarked'].map(embark_map).astype(int)
#Mapping Fare
data.loc[ data['Fare'] <= 7.91, 'Fare'] = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare'] = 2
data.loc[ data['Fare'] > 31, 'Fare'] = 3
data['Fare'] = data['Fare'].astype(int)
#Mapping Age
data.loc[ data['Age'] <= 16, 'Age'] = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age'] = 4
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
1.3 Feature Selection¶
[4]:
# Create list of columns to drop
drop_elements = ["PassengerId", "Name", "Ticket", "Cabin", "SibSp", "Parch", "family_size"]
# Drop columns from both data sets
clean_data = data.drop(drop_elements, axis = 1)
X = clean_data.drop("Survived", axis=1)
y = clean_data["Survived"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
X_train.to_csv("train_data.csv", index=False)
X_train.head()
[4]:
Pclass | Sex | Age | Fare | Embarked | is_alone | title | |
---|---|---|---|---|---|---|---|
615 | 2 | 0 | 1 | 3 | 0 | 0 | 2 |
13 | 3 | 1 | 2 | 3 | 0 | 0 | 1 |
509 | 3 | 1 | 1 | 3 | 0 | 1 | 1 |
494 | 3 | 1 | 1 | 1 | 0 | 1 | 1 |
470 | 3 | 1 | 1 | 0 | 0 | 1 | 1 |
1.4 ML train a RandomForest Model¶
[5]:
import pickle
from sklearn.ensemble import RandomForestClassifier
decision_tree = RandomForestClassifier(max_depth=5)
decision_tree.fit(X_train, y_train)
with open('model.pkl', 'wb') as pkl:
pickle.dump(decision_tree, pkl)
with open('func.pkl', 'wb') as func_pkl:
pickle.dump(decision_tree.predict_proba,func_pkl)
decision_tree = None
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
1.5 ML load model and evaluation¶
[6]:
with open('model.pkl', 'rb') as model_pkl:
model = pickle.load(model_pkl)
train_accuracy = round(model.score(X_train, y_train) * 100, 2)
print("Model Training Accuracy: ", train_accuracy)
test_accuracy = round(model.score(X_test, y_test) * 100, 2)
print("Model Testing Accuracy: ", test_accuracy)
Model Training Accuracy: 83.43
Model Testing Accuracy: 81.56
1.6 ML Inference and output result¶
[7]:
y_conf = model.predict_proba(X_test)
np.savetxt("y_conf.csv", y_conf, delimiter=",")
np.savetxt("y_true.csv", y_test, delimiter=",")
2. Involve Contextual AI complier¶
[8]:
import os
import sys
from pprint import pprint
sys.path.append('../../../')
from xai.compiler.base import Configuration, Controller
2.1 Specify config file¶
[9]:
json_config = 'basic-report-explainer.json'
2.2 Initial compiler controller with config¶
[10]:
controller = Controller(config=Configuration(json_config))
pprint(controller.config)
{'content_table': True,
'contents': [{'desc': 'This section summarized the training performance',
'sections': [{'component': {'attr': {'labels_file': 'labels.json',
'y_pred_file': 'y_conf.csv',
'y_true_file': 'y_true.csv'},
'class': 'ClassificationEvaluationResult',
'module': 'compiler',
'package': 'xai'},
'title': 'Training Result'}],
'title': 'Training Result'},
{'desc': 'This section provides the analysis on feature',
'sections': [{'component': {'_comment': 'refer to document '
'section xxxx',
'attr': {'train_data': 'train_data.csv',
'trained_model': 'model.pkl'},
'class': 'FeatureImportanceRanking'},
'title': 'Feature Importance Ranking'}],
'title': 'Feature Importance Analysis'},
{'desc': 'This section provides a model-agnostic explainer',
'sections': [{'component': {'attr': {'domain': 'tabular',
'feature_meta': 'feature_meta.json',
'method': 'lime',
'num_features': 5,
'predict_func': 'func.pkl',
'train_data': 'train_data.csv'},
'class': 'ModelAgnosticExplainer',
'module': 'compiler',
'package': 'xai'},
'title': 'Result'}],
'title': 'Model-Agnostic Explainer'},
{'desc': 'This section provides the analysis on data',
'sections': [{'component': {'_comment': 'refer to document '
'section xxxx',
'attr': {'data': 'titanic.csv',
'label': 'Survived'},
'class': 'DataStatisticsAnalysis'},
'title': 'Simple Data Statistic'}],
'title': 'Data Statistics Analysis'}],
'name': 'Report for Titanic Dataset',
'overview': True,
'writers': [{'attr': {'name': 'titanic-basic-report'}, 'class': 'Pdf'}]}
2.2 Finally compiler render¶
[11]:
controller.render()
../../../xai/compiler/explainer.py:120: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
train_data = train_data.as_matrix()
../../../xai/data/helper.py:156: UserWarning: Warning: the feature [PassengerId] is suspected to be key feature as it is monotonic integer.
[Examples]: [1, 2, 3, 4, 5]
'[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/data/helper.py:181: UserWarning: Warning: the feature [Ticket] is suspected to be identifiable feature.
[Examples]: ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450']
'[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/data/helper.py:181: UserWarning: Warning: the feature [Cabin] is suspected to be identifiable feature.
[Examples]: [nan, 'C85', nan, 'C123', nan]
'[Examples]: %s\n' % (column, col_data.tolist()[:5]))
../../../xai/formatter/portable_document/publisher.py:639: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/graphs/graph_generator.py:47: RuntimeWarning: invalid value encountered in double_scalars
ave_acc = np.sum(accuracy[condition]) / sample_num
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/Users/i309943/opt/anaconda3/envs/xai/lib/python3.6/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
../../../xai/formatter/portable_document/publisher.py:454: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/formatter/portable_document/publisher.py:459: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
warnings.warn(message='Warning: figure will exceed the page bottom, '
../../../xai/formatter/portable_document/publisher.py:531: UserWarning: Warning: figure will exceed the page edge on the right, rescale the whole group.
warnings.warn(message='Warning: figure will exceed the page edge '
../../../xai/formatter/portable_document/publisher.py:551: UserWarning: Warning: figure will exceed the page bottom, adding a new page.
warnings.warn(message='Warning: figure will exceed the page bottom, '
Result¶
[12]:
print("report generated : %s/titanic-basic-report.pdf" % os.getcwd())
report generated : /Users/i309943/workspace/Explainable_AI/tutorials/compiler/titanic2/titanic-basic-report.pdf
[13]:
print("explainer generated : %s/explainer.pkl" % os.getcwd())
explainer generated : /Users/i309943/workspace/Explainable_AI/tutorials/compiler/titanic2/explainer.pkl
Inference Explainer¶
[14]:
import xai
from xai.explainer.explainer_factory import ExplainerFactory
from pprint import pprint
explainer = ExplainerFactory.get_explainer(domain=xai.DOMAIN.TABULAR, algorithm=xai.ALG.LIME)
explainer.load_explainer('explainer.pkl')
explanations = explainer.explain_instance(instance=X_test.values[0,:],num_features=5)
pprint(explanations)
{0: {'explanation': [{'feature': 'Sex=male', 'score': 0.2667696553126673},
{'feature': 'Embarked=S', 'score': 0.032281049021469756},
{'feature': 'is_alone=Yes',
'score': -0.030266655575790963},
{'feature': 'Pclass=2', 'score': -0.056222958837791485},
{'feature': 'title=Rare', 'score': -0.07360802296209705}],
'prediction': 0.8584999999999999},
1: {'explanation': [{'feature': 'title=Rare', 'score': 0.07360802296209706},
{'feature': 'Pclass=2', 'score': 0.05622295883779148},
{'feature': 'is_alone=Yes', 'score': 0.030266655575790943},
{'feature': 'Embarked=S', 'score': -0.03228104902146975},
{'feature': 'Sex=male', 'score': -0.2667696553126673}],
'prediction': 0.14150000000000001}}