Data Explorer via Contextual AI

This tutorial demonstrats how to explorer different types of data using xai.data.explorer.

We have supported the following types of data: - Categorical - Numerical - Free Text - Datetime

[1]:
# Some auxiliary imports for the tutorial
import sys
import pandas as pd
import numpy as np
from collections import Counter
from sklearn import datasets
sys.path.append('../../')
import xai
/Users/i309943/.local/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

We use the following datasets as samples to demonstrate the above analyzers: - Breast cancer dataset: demonstration for xai.data.explorer.categorical - Iris plants dataset: demonstration for xai.data.explorer.numerical - The 20 newsgroups text dataset: demonstration for xai.data.explorer.text - Self-generated dataset: demonstration for xai.data.explorer.datetime

1. Categorical

Package xai.data.explorer.categorical analyzes categorical datasets and produces a frequency count for each categorical value. categorical_analyzer can feed in categorical values and return a categorical_stats object, which contains frequency count for each unique value. labelled_categorical_analyzer takes an additional argument class when fed values, and it generates frequency counts by class type.

[2]:
# load dataset
cols_names = ['Class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps',
              'deg-malig', 'breast', 'breast-quad', 'irradiat']
# read the data
breast_data = pd.read_csv('sample_data/breast-cancer.data', header=None,
                 names=cols_names).replace({'?': 'unknown'})  # NaN are represented by '?'

breast_data.head()
[2]:
Class age menopause tumor-size inv-nodes node-caps deg-malig breast breast-quad irradiat
0 no-recurrence-events 30-39 premeno 30-34 0-2 no 3 left left_low no
1 no-recurrence-events 40-49 premeno 20-24 0-2 no 2 right right_up no
2 no-recurrence-events 40-49 premeno 20-24 0-2 no 2 left left_low no
3 no-recurrence-events 60-69 ge40 15-19 0-2 no 2 right left_up no
4 no-recurrence-events 40-49 premeno 0-4 0-2 no 2 right right_low no
[3]:
from xai.data.explorer.categorical.labelled_categorical_analyzer import LabelledCategoricalDataAnalyzer

First, we analyze breast-quad column based on Class. The frequency count for each breast-quad is summarized per class type, as shown in the figure on the left. labelled_categorical_analyzer also generates a combined frequency counts across all classes, as shown in the figure on the right.

[4]:
label_column = 'Class'
feature_column = 'breast-quad'
labelled_analyzer = LabelledCategoricalDataAnalyzer()
labelled_analyzer.feed_all(breast_data[feature_column].tolist(),breast_data[label_column].tolist())
labelled_stats, all_stats = labelled_analyzer.get_statistics()

We define a helper class xai.plots.NotebookPlots to help visualize generated statistics.

[5]:
from xai.plots.data_stats_notebook_plots import NotebookPlots
plotter = NotebookPlots()
plotter.plot_labelled_categorical_stats(labelled_stats, all_stats, label_column, feature_column)
../../_images/tutorials_data_tutorial_data_explorer_9_0.png
../../_images/tutorials_data_tutorial_data_explorer_9_1.png

Similarly, we can analyze breast-quad column but choose menopause as class. The frequency count for each breast-quad is summarized per menopause type.

[6]:
label_column = 'menopause'
feature_column = 'breast-quad'
labelled_analyzer = LabelledCategoricalDataAnalyzer()
labelled_analyzer.feed_all(breast_data[feature_column].tolist(),breast_data[label_column].tolist())
labelled_stats, all_stats = labelled_analyzer.get_statistics()
plotter.plot_labelled_categorical_stats(labelled_stats, all_stats, label_column, feature_column)
../../_images/tutorials_data_tutorial_data_explorer_11_0.png
../../_images/tutorials_data_tutorial_data_explorer_11_1.png

2. Numerical

Package xai.data.explorer.numerical analyzes numerical datasets and produces statistics analysis including min, max, mean, median, sd, histogram and kde. numerical_analyzer takes all inputted numerical values and generates an overall analysis, while labelled_numerical_analyzer takes numerical values and class labels to generate analysis per class type.

[7]:
# load dataset
iris = datasets.load_iris()
iris_data = pd.DataFrame(iris.data,columns=iris.feature_names)
iris_data['Class'] = iris.target
iris_data.head()
[7]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Class
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
[8]:
from xai.data.explorer.numerical.labelled_numerical_analyzer import LabelledNumericalDataAnalyzer

First, we show statistics of sepal width (cm) with respect to Class.

[9]:
label_column = 'Class'
feature_column = 'sepal width (cm)'
labelled_analyzer = LabelledNumericalDataAnalyzer()
labelled_analyzer.feed_all(iris_data[feature_column].tolist(),iris_data[label_column].tolist())
labelled_stats, all_stats = labelled_analyzer.get_statistics(extreme_value_percentile=[0,100],num_of_bins=15)

plotter.plot_labelled_numerical_stats(labelled_stats,all_stats,label_column,feature_column)

sepal width (cm)

../../_images/tutorials_data_tutorial_data_explorer_16_1.png
class max mean median min sd total_count
0 0 4.4 3.418 0.377195 2.3 4.4 50
1 1 3.4 2.770 0.310644 2.0 3.4 50
2 2 3.8 2.974 0.319255 2.2 3.8 50
3 all 4.4 3.054 0.432147 2.0 4.4 150

Here is another example of statistics of petal length (cm) with respect to Class.

[10]:
label_column = 'Class'
feature_column = 'petal length (cm)'
labelled_analyzer = LabelledNumericalDataAnalyzer()
labelled_analyzer.feed_all(iris_data[feature_column].tolist(),iris_data[label_column].tolist())
labelled_stats, all_stats = labelled_analyzer.get_statistics(extreme_value_percentile=[0,100],num_of_bins=20)

plotter.plot_labelled_numerical_stats(labelled_stats,all_stats,label_column,feature_column)

petal length (cm)

../../_images/tutorials_data_tutorial_data_explorer_18_1.png
class max mean median min sd total_count
0 0 1.9 1.464000 0.171767 1.0 1.9 50
1 1 5.1 4.260000 0.465188 3.0 5.1 50
2 2 6.9 5.552000 0.546348 4.5 6.9 50
3 all 6.9 3.758667 1.758529 1.0 6.9 150

3. Text

Package xai.data.explorer.text analyzes free text datasets and produces statistics analysis including word count, character count, term frequency, document frequency and tfidf. text_analyzer takes all inputted free text and generates an overall analysis, while labelled_text_analyzer takes free text and associated class labels to generate statistical analysis per class type.

[11]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
sample_num = 2000
texts = data['data'][:sample_num]
labels = [data['target_names'][i] for i in data['target']][:sample_num]

We allow user to define the preprocessing function and some predefined pattern.

[12]:
from xai.data.explorer.text.labelled_text_analyzer import LabelledTextDataAnalyzer

def preprocess(text):
    import re
    text = re.sub(r'[^\w\s]',' ',text)
    return text

predefined_pattern = {'point':r'p\d','number':r'\d+','singlar':r'\s\w\s'}

First, we show tfidf and term frequency by class type. Then we show tfidf and term frequency across all classes.

[13]:
labelled_analyzer = LabelledTextDataAnalyzer(preprocess_fn=preprocess,stop_words_by_languages=['english'],
                                             predefined_pattern=predefined_pattern)
labelled_analyzer.feed_all(texts,labels)
labelled_stats, all_stats = labelled_analyzer.get_statistics()
plotter.plot_labelled_text_stats(labelled_stats, all_stats)

../../_images/tutorials_data_tutorial_data_explorer_24_0.png
../../_images/tutorials_data_tutorial_data_explorer_24_1.png
../../_images/tutorials_data_tutorial_data_explorer_24_2.png
../../_images/tutorials_data_tutorial_data_explorer_24_3.png
../../_images/tutorials_data_tutorial_data_explorer_24_4.png
../../_images/tutorials_data_tutorial_data_explorer_24_5.png
class longest doc number (doc count) number (term count) point (doc count) point (term count) singlar (doc count) singlar (term count) total count
0 rec.autos 1963 109 1748 2 2 109 2019 109
1 comp.sys.mac.hardware 566 108 1132 4 5 108 1324 108
2 comp.graphics 1595 92 1277 4 207 92 1632 92
3 sci.space 3782 87 1799 0 0 87 1794 87
4 talk.politics.guns 2552 100 1022 4 4 100 1924 100
5 all 15769 2000 69203 46 264 1997 62943 2000

4. Datetime

Package xai.data.explorer.datetime analyzes date time datasets and produces frequency counts for a given resolution, which can be year, month, day, weekday, hour, minute, or second. datetime_analyzer takes all values and generates an overall analysis, while labelled_datetime_analyzer generates statistical analysis per class type.

[14]:
filename = 'sample_data/date.csv'
data = pd.read_csv(filename)
data.head()

[14]:
COMPANYCODE POSTINGDATE VALUEDATE
0 13 20180309 20180309
1 13 20180223 20180223
2 13 20180223 20180223
3 13 20180223 20180223
4 13 20180227 20180227
[15]:
from xai.data.explorer.datetime.labelled_datetime_analyzer import LabelledDatetimeDataAnalyzer
from xai.data.constants import DatetimeResolution

We set the resolution to be year and month, so that input data will be grouped by their year and month.

[16]:
datetime_column = 'POSTINGDATE'
label_column = 'COMPANYCODE'
datetimes = [int(x) for x in data[datetime_column]]
labels =[int(x) for x in data[label_column]]

labelled_analyzer = LabelledDatetimeDataAnalyzer()
labelled_analyzer.feed_all(datetimes,labels)
labelled_stats, all_stats = labelled_analyzer.get_statistics(resolution_list=[DatetimeResolution.YEAR,
                                                                               DatetimeResolution.MONTH])

We define a helper function plot_labelled_datetime_stats to help visualize generated statistics. This function is only valid to plot datatime distribution with resolution YEAR and MONTH.

[17]:
import matplotlib.pyplot as plt
def plot_labelled_datetime_stats(labelled_stats, all_stats, label_name):
    def _plot(_class,_frequency_count):
        plt.figure(figsize=(8,4))
        data_dist = {k: v for (k, v) in sorted(_frequency_count.items())}
        min_year = list(data_dist.keys())[0]
        max_year = list(data_dist.keys())[-1]

        earliest = (min_year, min([int(month) for month in data_dist[min_year].keys()]))
        latest = (max_year, max([int(month) for month in data_dist[max_year].keys()]))

        line_num = len(data_dist)
        data_frame = []
        for year in data_dist:
            year_data = [0] * 13
            year_data[0] = year
            for month in data_dist[year]:
                year_data[int(month)] = int(data_dist[year][month])
            data_frame.append(year_data)
        data_frame = np.array(data_frame).astype(int)
        bars = []
        colors = ['#1abc9c', '#2ecc71', '#3498db', '#7f8c8d', '#9b59b6', '#34495e', '#f1c40f', '#e67e22', '#e74c3c',
                  '#95a5a6', '#d35400', '#bdc3c7']
        legends = ['Jan', 'Feb', 'Mar', 'April', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
        current_sum = np.zeros(line_num)
        for month in range(1, 13):
            p = plt.barh(y=list(range(0, line_num)), width=data_frame[:, month].tolist(), left=current_sum.tolist(),
                         color=colors[month - 1])
            bars.append(p)
            current_sum = data_frame[:, month] + current_sum
        for idx in range(line_num):
            plt.text(current_sum[idx], idx, int(current_sum[idx]), color='black', ha="left")

        plt.yticks(list(range(0, line_num)), list(data_dist.keys()))
        plt.legend(bars, legends, bbox_to_anchor=(1.04, 1), loc="upper left")
        plt.title(
            '%s - %s\nDate Range: %s.%s - %s.%s' % (label_name,_class,earliest[0], legends[earliest[1] - 1], latest[0], legends[latest[1] - 1]))
        plt.show()

    for _class in list(labelled_stats.keys())[:3]:
        _plot(_class,labelled_stats[_class].frequency_count)

    _plot('all',all_stats.frequency_count)

The figures below show frequency counts by class based on year and month as configured previsouly.

[18]:
plot_labelled_datetime_stats(labelled_stats, all_stats, label_column)
../../_images/tutorials_data_tutorial_data_explorer_33_0.png
../../_images/tutorials_data_tutorial_data_explorer_33_1.png
../../_images/tutorials_data_tutorial_data_explorer_33_2.png
../../_images/tutorials_data_tutorial_data_explorer_33_3.png

5. Data Analyzer Suite

xai.data.explorer.DataAnalyzerSuite is a tool to analyze a data set based on its defined schema. It support structured data (e.g. csv) and unstructured data (e.g. jsonline). It can be compatible to data set with sequence features.

[19]:
## load meta data
import json
from pprint import pprint

with open('./sample_data/meta.json','r') as f:
    meta = json.load(f)

pprint(meta)

# get feature names, feature types and sequence feature names,
sequence_names = list(meta['sequence'].keys())

data_type = []
column_names = []

for key, value in meta['non_sequence'].items():
    data_type.append(value['type'])
    column_names.append(key)

for key, value in meta['sequence'].items():
    data_type.append(value['type'])
    column_names.append(key)

print(sequence_names)
print(column_names)
print(data_type)
{'non_sequence': {'IC_0/COUNTRY': {'type': 'categorical'},
                  'IC_0/REGION': {'type': 'categorical'},
                  'IC_0/SEX': {'type': 'categorical'}},
 'sequence': {'IA_0/COMM_MEDIUM': {'type': 'categorical'},
              'IA_0/IA_INTERVAL': {'type': 'numerical'},
              'IA_0/IA_REASON': {'type': 'categorical'},
              'IA_0/IA_TYPE': {'type': 'categorical'}}}
['IA_0/IA_TYPE', 'IA_0/IA_INTERVAL', 'IA_0/COMM_MEDIUM', 'IA_0/IA_REASON']
['IC_0/COUNTRY', 'IC_0/REGION', 'IC_0/SEX', 'IA_0/IA_TYPE', 'IA_0/IA_INTERVAL', 'IA_0/COMM_MEDIUM', 'IA_0/IA_REASON']
['categorical', 'categorical', 'categorical', 'categorical', 'numerical', 'categorical', 'categorical']

We import xai.data.explorer.data_analyzer_suite and initialize an object of DataAnalyzerSuite. The initializer takes 3 parameters: - data_type_list: a list of pre-defined data type - columns_names: a list of column name. If this is not provided, the data_type_list should be the same length of the data - sequence_names: a list of feature names which is considered as sequence feature, so that each item of the value will be aggragated to get the overall stats.

[20]:
from xai.data.explorer.data_analyzer_suite import DataAnalyzerSuite

data_analyzer_suite = DataAnalyzerSuite(data_type_list=data_type,
                                        column_names=column_names,
                                        sequence_names=sequence_names)

Then we parse each json line of the data into a json object and feed into the analyzer, together with the label.

[21]:
## load json line data
with open('./sample_data/lead_score.data','r') as f:
    for line in f:
        data = json.loads(line)
        label = data['LABEL']
        data_analyzer_suite.feed_row(data,label)

Finally the get_statistics() function is called to generate the stats of the entire data based on the each column and its type. The returned stats is a dictionary which maps feature name to a tuple of stats (class-based stats and overall stats).

[22]:
stats = data_analyzer_suite.get_statistics()
pprint(stats)
{'IA_0/COMM_MEDIUM': ({0: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163bc18>,
                       1: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163b7f0>},
                      <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163b438>),
 'IA_0/IA_INTERVAL': ({0: <xai.data.explorer.numerical.numerical_stats.NumericalStats object at 0x7f83e163bba8>,
                       1: <xai.data.explorer.numerical.numerical_stats.NumericalStats object at 0x7f83e163b3c8>},
                      <xai.data.explorer.numerical.numerical_stats.NumericalStats object at 0x7f83e163bf60>),
 'IA_0/IA_REASON': ({0: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163bcf8>,
                     1: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163beb8>},
                    <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163b470>),
 'IA_0/IA_TYPE': ({0: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163b320>,
                   1: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83e163bf28>},
                  <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71ac8>),
 'IC_0/COUNTRY': ({0: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71d68>,
                   1: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71da0>},
                  <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71358>),
 'IC_0/REGION': ({0: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71240>,
                  1: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71208>},
                 <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71eb8>),
 'IC_0/SEX': ({0: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a716d8>,
               1: <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71cf8>},
              <xai.data.explorer.categorical.categorical_stats.CategoricalStats object at 0x7f83d3a71f60>)}

We plot IA_0/COMM_MEDIUM as an example for categorical features results, and IA_0/IA_INTERVAL as the numerical restuls.

[23]:
feature_column = 'IA_0/COMM_MEDIUM'
plotter.plot_labelled_categorical_stats(labelled_stats = stats[feature_column][0],
                                        all_stats = stats[feature_column][1],
                                        label_column = 'LABEL',
                                        feature_column = feature_column)
../../_images/tutorials_data_tutorial_data_explorer_43_0.png
../../_images/tutorials_data_tutorial_data_explorer_43_1.png
[24]:
feature_column = 'IA_0/IA_INTERVAL'
plotter.plot_labelled_numerical_stats(labelled_stats = stats[feature_column][0],
                                      all_stats = stats[feature_column][1],
                                      label_column = 'LABEL',
                                      feature_column = feature_column)

IA_0/IA_INTERVAL

../../_images/tutorials_data_tutorial_data_explorer_44_1.png
class max mean median min sd total_count
0 0 365.0 18.147545 41.756111 0.0 365.0 17249
1 1 363.0 18.162088 46.937193 0.0 363.0 728
2 all 365.0 18.148134 41.978351 0.0 365.0 17977