Introduction: Identify Customer Segments

In this project, I will apply unsupervised learning techniques and algorithms to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The dataset I will be using for this project has been provided by Udacity's partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

Part 1: Importing the Libraries

In [244]:
# Linear algebra and matrix manipulation
import numpy as np
import pandas as pd

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Sklearn
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

Part 2: Loading the Datasets

There are four files associated with this project (not including this Jupyter Notebook):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns.
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.

Each row of the demographics files represents a single person, but also includes information outside of individuals, including about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core database. This information can then be used for further applications, such as targeting for a marketing campaign.

To start off with, I will load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note that for all of the .csv data files in this project, they are semicolon (;) delimited, so I will need an additional argument in my read_csv() call to read in the data properly. Also, if you are going to run this project on your own system, note that the main dataset might take a while to load completely, given its rather big size.

In [97]:
# General demographics data
azdias = pd.read_csv("Udacity_AZDIAS_Subset.csv", sep=";")

# Feature summary file
feature_info = pd.read_csv("AZDIAS_Feature_Summary.csv", sep=";")

2.1. Exploring the Datasets

In [98]:
# Check the structure of the dataset
azdias.head()
Out[98]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

In [99]:
azdias.tail()
Out[99]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
891216 -1 3 2 5.0 1 4 2 5 4 4 ... 3.0 2.0 0.0 0.0 1.0 2.0 3.0 NaN NaN NaN
891217 -1 2 1 4.0 3 3 3 2 2 3 ... 1.0 3.0 3.0 2.0 4.0 5.0 3.0 4.0 6.0 5.0
891218 -1 2 2 4.0 2 4 2 5 4 3 ... 4.0 2.0 0.0 0.0 1.0 3.0 4.0 2.0 2.0 3.0
891219 -1 1 1 3.0 1 5 3 5 5 5 ... 1.0 4.0 3.0 1.0 5.0 1.0 1.0 4.0 7.0 5.0
891220 -1 4 1 1.0 4 2 5 2 1 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 4.0 5.0

5 rows × 85 columns

In [100]:
azdias.describe()
Out[100]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
count 891221.000000 891221.000000 891221.000000 886367.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 ... 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 794005.000000 794005.000000 794005.00000
mean -0.358435 2.777398 1.522098 3.632838 3.074528 2.821039 3.401106 3.033328 2.874167 3.075121 ... 2.253330 2.801858 1.595426 0.699166 1.943913 3.612821 3.381087 3.167854 5.293002 3.07222
std 1.198724 1.068775 0.499512 1.595021 1.321055 1.464749 1.322134 1.529603 1.486731 1.353248 ... 0.972008 0.920309 0.986736 0.727137 1.459654 0.973967 1.111598 1.002376 2.303739 1.36298
min -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.00000
25% -1.000000 2.000000 1.000000 2.000000 2.000000 1.000000 3.000000 2.000000 2.000000 2.000000 ... 1.000000 2.000000 1.000000 0.000000 1.000000 3.000000 3.000000 3.000000 4.000000 2.00000
50% -1.000000 3.000000 2.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ... 2.000000 3.000000 2.000000 1.000000 1.000000 4.000000 3.000000 3.000000 5.000000 3.00000
75% -1.000000 4.000000 2.000000 5.000000 4.000000 4.000000 5.000000 5.000000 4.000000 4.000000 ... 3.000000 3.000000 2.000000 1.000000 3.000000 4.000000 4.000000 4.000000 7.000000 4.00000
max 3.000000 9.000000 2.000000 6.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 ... 4.000000 4.000000 3.000000 2.000000 5.000000 5.000000 5.000000 9.000000 9.000000 9.00000

8 rows × 81 columns

In [101]:
azdias.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
AGER_TYP                 891221 non-null int64
ALTERSKATEGORIE_GROB     891221 non-null int64
ANREDE_KZ                891221 non-null int64
CJT_GESAMTTYP            886367 non-null float64
FINANZ_MINIMALIST        891221 non-null int64
FINANZ_SPARER            891221 non-null int64
FINANZ_VORSORGER         891221 non-null int64
FINANZ_ANLEGER           891221 non-null int64
FINANZ_UNAUFFAELLIGER    891221 non-null int64
FINANZ_HAUSBAUER         891221 non-null int64
FINANZTYP                891221 non-null int64
GEBURTSJAHR              891221 non-null int64
GFK_URLAUBERTYP          886367 non-null float64
GREEN_AVANTGARDE         891221 non-null int64
HEALTH_TYP               891221 non-null int64
LP_LEBENSPHASE_FEIN      886367 non-null float64
LP_LEBENSPHASE_GROB      886367 non-null float64
LP_FAMILIE_FEIN          886367 non-null float64
LP_FAMILIE_GROB          886367 non-null float64
LP_STATUS_FEIN           886367 non-null float64
LP_STATUS_GROB           886367 non-null float64
NATIONALITAET_KZ         891221 non-null int64
PRAEGENDE_JUGENDJAHRE    891221 non-null int64
RETOURTYP_BK_S           886367 non-null float64
SEMIO_SOZ                891221 non-null int64
SEMIO_FAM                891221 non-null int64
SEMIO_REL                891221 non-null int64
SEMIO_MAT                891221 non-null int64
SEMIO_VERT               891221 non-null int64
SEMIO_LUST               891221 non-null int64
SEMIO_ERL                891221 non-null int64
SEMIO_KULT               891221 non-null int64
SEMIO_RAT                891221 non-null int64
SEMIO_KRIT               891221 non-null int64
SEMIO_DOM                891221 non-null int64
SEMIO_KAEM               891221 non-null int64
SEMIO_PFLICHT            891221 non-null int64
SEMIO_TRADV              891221 non-null int64
SHOPPER_TYP              891221 non-null int64
SOHO_KZ                  817722 non-null float64
TITEL_KZ                 817722 non-null float64
VERS_TYP                 891221 non-null int64
ZABEOTYP                 891221 non-null int64
ALTER_HH                 817722 non-null float64
ANZ_PERSONEN             817722 non-null float64
ANZ_TITEL                817722 non-null float64
HH_EINKOMMEN_SCORE       872873 non-null float64
KK_KUNDENTYP             306609 non-null float64
W_KEIT_KIND_HH           783619 non-null float64
WOHNDAUER_2008           817722 non-null float64
ANZ_HAUSHALTE_AKTIV      798073 non-null float64
ANZ_HH_TITEL             794213 non-null float64
GEBAEUDETYP              798073 non-null float64
KONSUMNAEHE              817252 non-null float64
MIN_GEBAEUDEJAHR         798073 non-null float64
OST_WEST_KZ              798073 non-null object
WOHNLAGE                 798073 non-null float64
CAMEO_DEUG_2015          792242 non-null object
CAMEO_DEU_2015           792242 non-null object
CAMEO_INTL_2015          792242 non-null object
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_BAUMAX             757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797481 non-null float64
EWDICHTE                 797481 non-null float64
INNENSTADT               797481 non-null float64
GEBAEUDETYP_RASTER       798066 non-null float64
KKK                      770025 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        886367 non-null float64
REGIOTYP                 770025 non-null float64
KBA13_ANZAHL_PKW         785421 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   794005 non-null float64
ORTSGR_KLS9              794005 non-null float64
RELAT_AB                 794005 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 578.0+ MB
In [102]:
feature_info.head()
Out[102]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
In [103]:
feature_info.tail()
Out[103]:
attribute information_level type missing_or_unknown
80 PLZ8_HHZ macrocell_plz8 ordinal [-1]
81 PLZ8_GBZ macrocell_plz8 ordinal [-1]
82 ARBEIT community ordinal [-1,9]
83 ORTSGR_KLS9 community ordinal [-1,0]
84 RELAT_AB community ordinal [-1,9]
In [104]:
feature_info.describe()
Out[104]:
attribute information_level type missing_or_unknown
count 85 85 85 85
unique 85 9 5 9
top SEMIO_KRIT person ordinal [-1]
freq 1 43 49 26
In [105]:
feature_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 4 columns):
attribute             85 non-null object
information_level     85 non-null object
type                  85 non-null object
missing_or_unknown    85 non-null object
dtypes: object(4)
memory usage: 2.7+ KB

Part 3: Preprocessing

The feature summary file contains a summary of properties for each demographics data column. I will use this file to make data-cleaning decisions during this stage of the project. First of all, I will assess the demographics data in terms of missing data.

3.1. Converting Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded above as feature_info documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1, 0], this will get read in as a string object. So, I need to do a little bit of parsing to make use of it to identify and clean the data. I want to see how much data takes on a "missing" or "unknown" code, and how much data is naturally missing, as a point of interest.

In [106]:
# Double sum to add them all
print("Number of missing values: {}".format(azdias.isnull().sum().sum()))
Number of missing values: 4896838

Go ahead and take a look at the feature_info DataFrame. You will see that it contains values like X or XX or even empty values for the missing_or_unknown column. We need to work on them.

In [107]:
feature_info[57:60] # Take a look at the missing_or_unknown column!
Out[107]:
attribute information_level type missing_or_unknown
57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1,X]
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX]
59 CAMEO_INTL_2015 microcell_rr4 mixed [-1,XX]
In [108]:
for row in range(len(feature_info)):
    m_o_u = feature_info.iloc[row]["missing_or_unknown"]
    m_o_u = m_o_u.strip("[").strip("]").split(sep=",")
    m_o_u = [int(self) if (self!="" and self!="X" and self!="XX") else self for self in m_o_u]

    # Modifying azdias as well. According to the changes we made in feature_info
    if m_o_u != [""]:
        azdias = azdias.replace({feature_info.iloc[row]["attribute"]: m_o_u}, np.nan)
In [109]:
# Sum them all up again to check
print("Number of missing values: {}".format(azdias.isnull().sum().sum()))
Number of missing values: 8373929

Looks just fine.

3.2. Assessing Missing Data in Each Column

There are a few columns that are outliers in terms of the proportion of values that are missing in them. I am going to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns.

In [110]:
# Finding the percentage of the missing data in each column
columns_missing_percentage = 100*(azdias.isnull().sum()/len(azdias))
In [111]:
# Plot it out to see it better
plt.figure(figsize=(8,4))
plt.hist(columns_missing_percentage, bins=40)
plt.xlabel("Percentage of Missing Data")
plt.ylabel("Number of Columns")
plt.show()
In [112]:
# Plot the percentage of missing values for each column
columns_sorted = columns_missing_percentage.sort_values(ascending=False)
plt.figure(figsize=(16,8))
plt.xlabel("Column Titles")
plt.ylabel("Percentage of Missing Values")
plt.title("Percentage of Missing Values of Each Respective Column")
columns_sorted.plot.bar(color="blue");

15% looks like a reasonable threshold for our purpose.

In [113]:
# Select the columns which have more than 15 percent of missing value rate
columns_missing_percentage_15 = columns_missing_percentage[columns_missing_percentage>15]
columns_missing_percentage_15_list = columns_missing_percentage_15.index.tolist()

# Plot it to see it better
plt.figure(figsize=(8,4))
plt.xlabel("Column Titles")
plt.ylabel("Percentage of Missing Values")
plt.title("Columns Which Have More Than 15 Percent of Missing Value Rate")
plt.grid()
columns_missing_percentage_15.sort_values(ascending=False).plot.bar(color="blue");
In [114]:
# Remove the outlier columns from the dataset
azdias = azdias.drop(columns_missing_percentage_15_list, axis=1)

# Communicate the step above
str1 = ", ".join(columns_missing_percentage_15_list).strip("[").strip("]")
print("Columns that had more that 15 percent of missing data "
      "and therefore dropped are: {}".format(str1)+".")
Columns that had more that 15 percent of missing data and therefore dropped are: AGER_TYP, GEBURTSJAHR, TITEL_KZ, ALTER_HH, KK_KUNDENTYP, W_KEIT_KIND_HH, KBA05_BAUMAX, KKK, REGIOTYP.

So it seems like we got rid of 6 columns, namely AGER_TYP, GEBURTSJAHR, TITEL_KZ, ALTER_HH, KK_KUNDENTYP, W_KEIT_KIND_HH, KBA05_BAUMAX, KKK, REGIOTYP. Before this step, we had 85 columns. Now, let's check whether everything is okay and that we indeed got rid of those 9 columns.

In [115]:
azdias.head()
Out[115]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 2 5.0 1 5 2 5 4 5 1 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 3.0 2 3.0 1 4 1 2 3 5 1 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 3.0 1 5.0 4 3 4 1 3 2 5 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 76 columns

title

3.3. Assessing Missing Data in Each Row

Now, I will perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, we should be able to see some groups of points that have a very different numbers of missing values. I will divide the data into two subsets; one for the data points that are above some threshold for missing values, and a second subset for points below that threshold.

In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. I will select at least five of these columns and compare the distribution of values.

Depending on what we observe in our comparison, this will have implications on how we approach our conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special.

In [116]:
# Finding the number of missing data in each row
rows_missing_count = azdias.isnull().sum(axis=1)
In [117]:
rows_missing_count.describe()
Out[117]:
count    891221.00000
mean          5.12913
std          12.36693
min           0.00000
25%           0.00000
50%           0.00000
75%           1.00000
max          46.00000
dtype: float64

We can see above that our dataset is generally doing okay. We have a low mean and relatively low standard deviation. But considering that after we eliminated those 9 columns in the step 3.2., 46 is a pretty high number of missing values for a given row, and we have to work on that.

In [118]:
# Percentage of rows with no missing data
n_rows_wout_missing = rows_missing_count[rows_missing_count==0].count()
n_rows_wout_missing_percentage = 100*n_rows_wout_missing/azdias.shape[0]
print("The number of rows with no missing data is", n_rows_wout_missing)
print("Percentage of rows with no missing data over all rows is {:.2f}%".format(n_rows_wout_missing_percentage))
The number of rows with no missing data is 668241
Percentage of rows with no missing data over all rows is 74.98%
In [119]:
# Visualization of the percentage of data missing in rows
plt.figure(figsize=(8,4))
plt.hist(rows_missing_count, bins=25)
plt.xlabel("Percentage of Missing Data in Each Row")
plt.ylabel("Frequency")
plt.title("Frequency Graph of Missing Data Percentage in Rows")
plt.show()
In [120]:
# Divide the data into two subsets based on the number of missing values in each row
azdias_under_6 = azdias[azdias.isnull().sum(axis=1) <= 6]
azdias_over_6 = azdias[azdias.isnull().sum(axis=1) > 6]
In [121]:
# Compare the distribution of the values for at least five columns 
# where there are no or few missing values between the two datasets.
columns_wout_missing_values = columns_missing_percentage[columns_missing_percentage==0].index.tolist()
columns_selected = columns_wout_missing_values[:5]
columns_selected
Out[121]:
['ANREDE_KZ',
 'FINANZ_MINIMALIST',
 'FINANZ_SPARER',
 'FINANZ_VORSORGER',
 'FINANZ_ANLEGER']
In [122]:
# Visualizing those five columns
figure, axs = plt.subplots(nrows=len(columns_selected), ncols=2, figsize=(16,24))
figure.subplots_adjust(hspace=0.7, wspace=0.3)
for col in range(len(columns_selected)):
    sns.countplot(azdias_under_6[columns_selected[col]], ax=axs[col][0])
    axs[col][0].set_title("No or Few Missing Values")

    sns.countplot(azdias_over_6[columns_selected[col]], ax=axs[col][1])
    axs[col][1].set_title("Many Missing Values")

Part 4: Selecting and Re-Encoding the Features

Checking for missing data isn't the only way in which we can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, we need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

  • For numeric and interval data, these features can be kept without changes.
  • Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
  • Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, I will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether I will keep, drop, or re-encode each. Then, in the last part, I will create a new data frame with only the selected and engineered columns.

Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. So, let's begin!

In [139]:
# How many features are there of each data type?
features = list(azdias_under_6.columns)
feature_info_updated = feature_info[feature_info["attribute"].isin(features)]
feature_count = feature_info_updated["type"].value_counts()

for v in range(len(feature_count)):
    print("There are {} {} features.".format(feature_count[v], feature_count.index[v]))
There are 46 ordinal features.
There are 18 categorical features.
There are 6 mixed features.
There are 6 numeric features.

Looking good.

4.2. Re-Encode Categorical Features

In [140]:
# Assessing the categorical variables: which are binary,
# which are multi-level, and which ones need to be re-encoded?
categorical_features = feature_info_updated[feature_info_updated["type"]=="categorical"]["attribute"]
In [149]:
# Separating binary and multi-level features
features_binary = []
features_multilevel = []

for feature in categorical_features:
    if len(azdias_under_6[feature].unique())==2:
        features_binary.append(feature)
    else:
        features_multilevel.append(feature)

print("Binary features are: {}.".format(", ".join(features_binary)))
print("\nMulti-level features are: {}.".format(", ".join(features_multilevel)))
Binary features are: ANREDE_KZ, GREEN_AVANTGARDE, SOHO_KZ, OST_WEST_KZ.

Multi-level features are: CJT_GESAMTTYP, FINANZTYP, GFK_URLAUBERTYP, LP_FAMILIE_FEIN, LP_FAMILIE_GROB, LP_STATUS_FEIN, LP_STATUS_GROB, NATIONALITAET_KZ, SHOPPER_TYP, VERS_TYP, ZABEOTYP, GEBAEUDETYP, CAMEO_DEUG_2015, CAMEO_DEU_2015.
In [153]:
# Assessing the unique values
for feature in features_binary:
    print("Unique binary values for {} are: {}.".format(feature, azdias_under_6[feature].unique()))
Unique binary values for ANREDE_KZ are: [2 1].
Unique binary values for GREEN_AVANTGARDE are: [0 1].
Unique binary values for SOHO_KZ are: [1. 0.].
Unique binary values for OST_WEST_KZ are: ['W' 'O'].

It is clear that we need to work on OST_WEST_KZ's values.

In [155]:
# Re-encoding categorical variable(s) to be kept in the analysis
values_encoded = {"W":0, "O":1}
azdias_encoded = azdias_under_6.replace({"OST_WEST_KZ":values_encoded})
In [156]:
# Dropping multi-level features
for feature in features_multilevel:
    azdias_encoded = azdias_encoded.drop(feature, axis=1)

4.3. Engineering Mixed-Type Features

There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to our own choices:

  • "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, we should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
  • "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. I will break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
  • The levels for "PRAEGENDE_JUGENDJAHRE" are as follows: (see Data_Dictionary.md)
    • 40s - war years (Mainstream, E+W)
    • 40s - reconstruction years (Avantgarde, E+W)
    • 50s - economic miracle (Mainstream, E+W)
    • 50s - milk bar / Individualisation (Avantgarde, E+W)
    • 60s - economic miracle (Mainstream, E+W)
    • 60s - generation 68 / student protestors (Avantgarde, W)
    • 60s - opponents to the building of the Wall (Avantgarde, E)
    • 70s - family orientation (Mainstream, E+W)
    • 70s - peace movement (Avantgarde, E+W)
    • 80s - Generation Golf (Mainstream, W)
    • 80s - ecological awareness (Avantgarde, W)
    • 80s - FDJ / communist party youth organisation (Mainstream, E)
    • 80s - Swords into ploughshares (Avantgarde, E)
    • 90s - digital media kids (Mainstream, E+W)
    • 90s - ecological awareness (Avantgarde, E+W)
In [157]:
# Investigating the "PRAEGENDE_JUGENDJAHRE" feature
# and engineering two new variables from it.

# Grouping decades and movements into more understandable formats
decade = {1:1, 2:1, 3:2, 4:2, 5:3, 6:3, 7:3, 8:4, 9:4, 10:5, 11:5, 12:5, 13:5, 14:6, 15:6}
# 1: Mainstream, 2: Avantgarde
movement = {1:1, 2:0, 3:1, 4:0, 5:1, 6:0, 7:0, 8:1, 9:0, 10:1, 11:0, 12:1, 13:0, 14:1, 15:0}

# Creating two new variables
azdias_encoded["DECADE"] = azdias_encoded["PRAEGENDE_JUGENDJAHRE"]
azdias_encoded["MOVEMENT"] = azdias_encoded["PRAEGENDE_JUGENDJAHRE"]

# Assigning the data to the two new variables
azdias_encoded["DECADE"].replace(decade, inplace=True)
azdias_encoded["MOVEMENT"].replace(movement, inplace=True)

The levels for CAMEO_INTL_2015 are as follows: (see Data_Dictionary.md)

  1. Wealthy Households - Pre-Family Couples & Singles
  2. Wealthy Households - Young Couples With Children
  3. Wealthy Households - Families With School Age Children
  4. Wealthy Households - Older Families & Mature Couples
  5. Wealthy Households - Elders In Retirement
  1. Prosperous Households - Pre-Family Couples & Singles
  2. Prosperous Households - Young Couples With Children
  3. Prosperous Households - Families With School Age Children
  4. Prosperous Households - Older Families & Mature Couples
  5. Prosperous Households - Elders In Retirement
  1. Comfortable Households - Pre-Family Couples & Singles
  2. Comfortable Households - Young Couples With Children
  3. Comfortable Households - Families With School Age Children
  4. Comfortable Households - Older Families & Mature Couples
  5. Comfortable Households - Elders In Retirement
  1. Less Affluent Households - Pre-Family Couples & Singles
  2. Less Affluent Households - Young Couples With Children
  3. Less Affluent Households - Families With School Age Children
  4. Less Affluent Households - Older Families & Mature Couples
  5. Less Affluent Households - Elders In Retirement
  1. Poorer Households - Pre-Family Couples & Singles
  2. Poorer Households - Young Couples With Children
  3. Poorer Households - Families With School Age Children
  4. Poorer Households - Older Families & Mature Couples
  5. Poorer Households - Elders In Retirement
In [158]:
# Investigating the "CAMEO_INTL_2015" feature
# and engineering two new variables from it.

# Grouping wealth and life stage into more understandable formats
wealth = {"11":1, "12":1, "13":1, "14":1, "15":1,
          "21":2, "22":2, "23":2, "24":2, "25":2,
          "31":3, "32":3, "33":3, "34":3, "35":3,
          "41":4, "42":4, "43":4, "44":4, "45":4,
          "51":5, "52":5, "53":5, "54":5, "55":5}

life_stage = {"11":1, "12":2, "13":3, "14":4, "15":5,
              "21":1, "22":2, "23":3, "24":4, "25":5,
              "31":1, "32":2, "33":3, "34":4, "35":5,
              "41":1, "42":2, "43":3, "44":4, "45":5,
              "51":1, "52":2, "53":3, "54":4, "55":5}

# Creating two new variables
azdias_encoded["WEALTH"] = azdias_encoded["CAMEO_INTL_2015"]
azdias_encoded["LIFE_STAGE"] = azdias_encoded["CAMEO_INTL_2015"]

# Assigning the data to the two new variables
azdias_encoded["WEALTH"].replace(wealth, inplace=True)
azdias_encoded["LIFE_STAGE"].replace(life_stage, inplace=True)

4.4. Completing Feature Selection

In order to finish this step up, we need to make sure that our dataframe now only has the columns that we want to keep. To summarize, the dataframe should consist of the following:

  • All numeric, interval, and ordinal type columns from the original dataset.
  • Binary categorical features (all numerically-encoded).
  • Engineered features from other multi-level categorical features and mixed features.

We need to make sure that for any new columns that we have engineered, that we've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, we should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features we created should be retained.

In [159]:
# Make sure that the dataframe only contains the 
# columns that should be passed to the algorithm functions.
features_mixed = feature_info_updated[feature_info_updated["type"]=="mixed"]["attribute"]
for feature in features_mixed:
    azdias_encoded.drop(feature, axis=1, inplace=True)

Part 5: Creating a Cleaning Function that "Does It All"

Even though I have finished cleaning up the general population demographics data, it is always important to look ahead to the future and realize that I, or someone else, might need to perform the same cleaning steps on the customer demographics data. In this part, I will create function to execute the main feature selection, encoding, and re-engineering steps that I have performed above.

In [272]:
def clean_data(dataframe):
    """
    Perform feature trimming, re-encoding, and engineering for demographics data.
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """

    for row in range(len(feature_info)):
        m_o_u = feature_info.iloc[row]["missing_or_unknown"]
        m_o_u = m_o_u.strip("[").strip("]").split(sep=",")
        m_o_u = [int(self) if (self!="" and self!="X" and self!="XX") else self for self in m_o_u]

        # Modifying azdias as well. According to the changes we made in feature_info
        if m_o_u != [""]:
            dataframe_clean = dataframe.replace({feature_info.iloc[row]["attribute"]: m_o_u}, np.nan)

    for column in dataframe.columns:
        dataframe_clean = dataframe_clean.replace({column:["X","XX"]}, np.nan)

    columns_15_missing = ["AGER_TYP", "GEBURTSJAHR", "TITEL_KZ", "ALTER_HH",
                          "KK_KUNDENTYP", "W_KEIT_KIND_HH", "KBA05_BAUMAX", "KKK", "REGIOTYP"]

    dataframe_clean = dataframe_clean.drop(columns_15_missing, axis=1)

    dataframe_clean = dataframe_clean[dataframe_clean.isnull().sum(axis=1) <= 6]

    encode = {"W":0, "O":1}
    dataframe_clean = dataframe_clean.replace({"OST_WEST_KZ":encode})

    categorical = feature_info_updated[feature_info_updated["type"]=="categorical"]["attribute"]
    multilevel=[]

    for f in categorical:
        if (len(azdias_under_6[f].unique())>2): #if multilevel
            multilevel.append(f)

    for f in multilevel:
        dataframe_clean = dataframe_clean.drop(f, axis=1)

    # Grouping decades and movements into more understandable formats
    decade = {1:1, 2:1, 3:2, 4:2, 5:3, 6:3, 7:3, 8:4, 9:4, 10:5, 11:5, 12:5, 13:5, 14:6, 15:6}
    # 1: Mainstream, 2: Avantgarde
    movement = {1:1, 2:0, 3:1, 4:0, 5:1, 6:0, 7:0, 8:1, 9:0, 10:1, 11:0, 12:1, 13:0, 14:1, 15:0}

    # Creating two new variables
    azdias_encoded["DECADE"] = azdias_encoded["PRAEGENDE_JUGENDJAHRE"]
    azdias_encoded["MOVEMENT"] = azdias_encoded["PRAEGENDE_JUGENDJAHRE"]

    # Assigning the data to the two new variables
    azdias_encoded["DECADE"].replace(decade, inplace=True)
    azdias_encoded["MOVEMENT"].replace(movement, inplace=True)

    # Grouping wealth and life stage into more understandable formats
    wealth = {"11":1, "12":1, "13":1, "14":1, "15":1,
              "21":2, "22":2, "23":2, "24":2, "25":2,
              "31":3, "32":3, "33":3, "34":3, "35":3,
              "41":4, "42":4, "43":4, "44":4, "45":4,
              "51":5, "52":5, "53":5, "54":5, "55":5}

    life_stage = {"11":1, "12":2, "13":3, "14":4, "15":5,
                  "21":1, "22":2, "23":3, "24":4, "25":5,
                  "31":1, "32":2, "33":3, "34":4, "35":5,
                  "41":1, "42":2, "43":3, "44":4, "45":5,
                  "51":1, "52":2, "53":3, "54":4, "55":5}

    # Creating two new variables
    azdias_encoded["WEALTH"] = azdias_encoded["CAMEO_INTL_2015"]
    azdias_encoded["LIFE_STAGE"] = azdias_encoded["CAMEO_INTL_2015"]

    # Assigning the data to the two new variables
    azdias_encoded["WEALTH"].replace(wealth, inplace=True)
    azdias_encoded["LIFE_STAGE"].replace(life_stage, inplace=True)

    mixed = feature_info_updated[feature_info_updated["type"]=="mixed"]["attribute"]
    for i in mixed:
        dataframe_clean.drop(i, axis=1, inplace=True)

    return dataframe_clean

Part 6: Feature Transformation

6.1. Feature Scaling

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Refer sklearn's API Reference.

In [270]:
# Investigate for any possible NaN values
missing = Imputer(strategy="median")
azdias_encoded_filled = pd.DataFrame(missing.fit_transform(azdias_encoded))

# Copy the data
azdias_encoded_filled.index = azdias_encoded.index
azdias_encoded_filled.columns = azdias_encoded.columns
In [172]:
# Apply feature scaling to the general population demographics data
scaler = StandardScaler()
azdias_encoded_scaled = scaler.fit_transform(azdias_encoded_filled)

# Check
columns = list(azdias_encoded_filled)
azdias_encoded_scaled = pd.DataFrame(azdias_encoded_scaled, columns=columns)
azdias_encoded_scaled.head()
Out[172]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB DECADE MOVEMENT WEALTH LIFE_STAGE
0 -1.745854 0.965965 -1.523782 1.578387 -1.048092 1.505338 1.008562 1.348136 -0.538188 1.072685 ... 0.417744 1.426809 0.555393 -0.174572 -0.128346 0.682723 1.159439 0.538188 1.184151 -1.267587
1 0.202381 0.965965 -1.523782 0.899009 -1.770702 -0.551466 0.286707 1.348136 1.858087 1.072685 ... -0.958160 0.399799 0.555393 -0.174572 -0.128346 -0.791908 1.159439 -1.858087 -0.864785 0.755094
2 1.176499 0.965965 0.672340 -0.459748 1.119738 -0.551466 -1.157002 -0.803376 -0.538188 -0.267794 ... -0.958160 -0.627212 0.555393 -1.179160 -0.998650 -0.054593 -0.221052 0.538188 -1.547764 -0.593360
3 0.202381 -1.035234 0.672340 0.219631 0.397128 -1.237068 0.286707 -0.803376 -0.538188 1.072685 ... 0.417744 -0.627212 -0.344396 0.830017 0.306807 1.420038 -0.221052 0.538188 0.501172 0.080867
4 -1.745854 0.965965 -0.059701 -1.139126 1.119738 -0.551466 -0.435147 1.348136 -0.538188 1.072685 ... 0.417744 1.426809 1.455183 -1.179160 -0.998650 -0.054593 -1.601544 0.538188 1.184151 0.755094

5 rows × 60 columns

Looking good.

6.2. Dimensionality Reduction

In [200]:
# Apply PCA to the data
pca = PCA()
pca.fit(azdias_encoded_scaled)
Out[200]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [201]:
# Investigate the variance accounted for by each principal component
plt.figure(figsize=(8,4))
plt.title("Explained Variance Per Principal Component")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained (%)")
plt.bar(range(len(pca.explained_variance_ratio_)), pca.explained_variance_ratio_);
In [202]:
# Investigate the cumulatives of principal components
plt.figure(figsize=(8,4))
plt.title("Cumulative Variance of Principal Components")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained (%)")
plt.plot(range(len(pca.explained_variance_ratio_)), np.cumsum(pca.explained_variance_ratio_));

Looking at the plot above, choosing 40 principal components should get us more than 90% of explanation, which is great!

In [257]:
# Re-apply PCA to the data while selecting for number of components to retain
pca_with_40 = PCA(n_components=40)
pca_40 = pca_with_40.fit_transform(azdias_encoded_scaled)

6.3. Interpreting Principal Components

Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

In [237]:
# Map weights for the first principal component to corresponding feature 
# names and then print the linked values, sorted by weight.
def pca_weights(pca, p_i, n_top=5, n_bottom=5):
    dataframe = pd.DataFrame(pca.components_, columns=list(azdias_encoded_scaled.columns))
    weights = dataframe.iloc[p_i].sort_values(ascending=False)

    print("### For the Principal Component {} ###".format(p_i+1))
    print("{} most relevant columns and their respective relevance ratios are:".format(n_top))
    print(weights[:n_top])
    print("\n")
    print("{} least relevant columns and their respective relevance ratios are:".format(n_bottom))
    print(weights[-n_bottom:])
In [238]:
# The first principal component
pca_weights(pca_40, 0, n_top=5)
### For the Principal Component 1 ###
5 most relevant columns and their respective relevance ratios are:
PLZ8_ANTG3            0.225926
PLZ8_ANTG4            0.219228
WEALTH                0.204273
HH_EINKOMMEN_SCORE    0.200852
ORTSGR_KLS9           0.196130
Name: 0, dtype: float64


5 least relevant columns and their respective relevance ratios are:
KBA05_GBZ           -0.215090
FINANZ_MINIMALIST   -0.221236
KBA05_ANTG1         -0.222210
PLZ8_ANTG1          -0.225669
MOBI_REGIO          -0.240875
Name: 0, dtype: float64
In [239]:
# The second principal component
pca_weights(pca_40, 1, n_top=5)
### For the Principal Component 2 ###
5 most relevant columns and their respective relevance ratios are:
ALTERSKATEGORIE_GROB    0.254967
SEMIO_ERL               0.234246
FINANZ_VORSORGER        0.225791
SEMIO_LUST              0.180244
RETOURTYP_BK_S          0.163419
Name: 1, dtype: float64


5 least relevant columns and their respective relevance ratios are:
SEMIO_PFLICHT   -0.225885
SEMIO_TRADV     -0.227181
FINANZ_SPARER   -0.231116
DECADE          -0.244539
SEMIO_REL       -0.257421
Name: 1, dtype: float64
In [242]:
# The third principal component
pca_weights(pca_40, 2, n_top=5)
### For the Principal Component 3 ###
5 most relevant columns and their respective relevance ratios are:
SEMIO_VERT           0.345742
SEMIO_SOZ            0.260257
SEMIO_FAM            0.246436
SEMIO_KULT           0.228801
FINANZ_MINIMALIST    0.159951
Name: 2, dtype: float64


5 least relevant columns and their respective relevance ratios are:
SEMIO_RAT    -0.223258
SEMIO_KRIT   -0.271582
SEMIO_DOM    -0.312256
SEMIO_KAEM   -0.333878
ANREDE_KZ    -0.367213
Name: 2, dtype: float64

Part 7: Clustering

7.1. Applying Clustering to General Population

I have assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space.

In [254]:
# Define a function to get the KMeans score easily
def get_kmean_score(pca, n_clu):
    kmeans = KMeans(n_clusters= n_clu)
    model = kmeans.fit(pca)
    score = np.abs(model.score(pca))
    return score
In [260]:
# Over a number of different cluster counts
# run k-means clustering on the data and
# compute the average within-cluster distances
scores = []
k_values = list(range(1,20))
for k in k_values:
    scores.append(get_kmean_score(pca_40, k))
    print("Fitting k = {}".format(k))
Fitting k = 1
Fitting k = 2
Fitting k = 3
Fitting k = 4
Fitting k = 5
Fitting k = 6
Fitting k = 7
Fitting k = 8
Fitting k = 9
Fitting k = 10
Fitting k = 11
Fitting k = 12
Fitting k = 13
Fitting k = 14
Fitting k = 15
Fitting k = 16
Fitting k = 17
Fitting k = 18
Fitting k = 19
In [265]:
# Investigate the change in within-cluster distance across number of clusters.
plt.figure(figsize=(8,4))
plt.title("K-Means Clustering")
plt.xlabel("K Value")
plt.ylabel("SSE")
plt.plot(k_values, scores, linestyle="-", marker="o")
Out[265]:
[<matplotlib.lines.Line2D at 0xd6affb69b0>]

Looks like 15 is the elbow, so I am gonna go with it.

In [266]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=15)
model_k_15 = kmeans.fit(pca_40)
model_k_15_pred = model_k_15.predict(pca_40)

7.2. Working with the Customer Data

Now that we have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters.

In [267]:
# Load in the customer demographics data
customers = pd.read_csv("Udacity_CUSTOMERS_Subset.csv", sep=";")
In [273]:
# Apply preprocessing, feature transformation, and clustering 
# from the general demographics onto the customer data, obtaining
# cluster predictions for the customer demographics data.

# Using the function that we made earlier
customers_cleaned = clean_data(customers)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\DS\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'PRAEGENDE_JUGENDJAHRE'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-273-dc24791659ab> in <module>
      4
      5 # Using the function that we made earlier
----> 6 customers_cleaned = clean_data(customers)

<ipython-input-272-d2ef0caa666d> in clean_data(dataframe)
     45
     46     # Creating two new variables
---> 47     azdias_encoded["DECADE"] = azdias_encoded["PRAEGENDE_JUGENDJAHRE"]
     48     azdias_encoded["MOVEMENT"] = azdias_encoded["PRAEGENDE_JUGENDJAHRE"]
     49

C:\DS\envs\myenv\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

C:\DS\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'PRAEGENDE_JUGENDJAHRE'
In [ ]: