Project 1: Customer Relationship Prediction

Customer Relationship Management (CRM) is a common strategy for managing a company's interactions with customers, clients and sales prospects. It tries to organize, automate, and synchronize business processes especially sales activities. The main goals are to find, attract, and win new clients, improve and retain those the company already has and reduce the costs of marketing and client service. One of the most practical ways is to produce scores for customers. Target score values to be predicted in this project are appetency and up-selling. The score is computed by building a prediction model with input variables which describe instances. Then, scores can be used to personalize the customer relationship.

Task:

For this project the task is to estimate the appetency and up-selling probability of customers. Hence there are two target values to be predicted as follows:

1)      Appetency:Appetency is the propensity to buy a service or a product.

2)      Up-selling:Up-selling is a sales technique where they attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale.

 

Data

 

The data are available with 230 numerical and categorical variables. The first 190 variables are numerical and the last 40 are categorical. Training and test Data sets can be downloaded here, in 3 different formats:

 

1)      CRP_not_numerical_data.zip(This file contains all 230 variables. The first 190 are numerical and the last 40 categorical

2)      CRP_numerical version_matlab_format.zip(*.mat file)

3)      CRP_numerical version_txt_format.zip (*.txt file)(This file contains the convereted numerical values of the last 40 categorical variables.)

 

CRP_data.zip (all three version)

 

True task labels (real binary targets) are also available from here:

train_appentency_labels.labels, train_upselling_labels.labels

 

Data Format

 

The datasets use a format similar as that of the text export format from relational databases. You can easily open them with excel or even notepad.

·         One header lines with the variables names (Var1, Var2, Var3, ... , Var230)

·         One line per instance (there are 50000 instances)

·         Separator tabulation between the values

·         There are missing values (consecutive tabulations)

Hint:The header line is present only in the first chunk of datasets. The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that appetency and up-selling are two separate binary classification problems. The target values are +1 or -1.Examples having +1 (resp. -1) target values are referred as positive (resp. negative) examples. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables.

 

Evaluation

 

True Positive Rate and False Positive Rate

 

The main objective of the project is to make good predictions of the target variables. The prediction of each target variable is thought of as a separate classification problem. The results of classification, obtained by thresholding the prediction score, may be represented in a confusion matrix, where TP (True Positive), FN (False Negative), TN (True Negative) and FP (False Positive) represent the number of examples falling into each possible outcome:

 

Prediction

Class +1

Class -1

Truth

Class +1

TP

FN

Class _1

FP

TN

Any sort of numeric prediction score is allowed, larger numerical values indicating higher confidence in positive class membership. Hence, the True Positive Rate and the False Positive Rate are as:

True Positive Rate = TP/ (TP + FN)

False Positive Rate = FP / (TN + FP)

Since there is only one data set, a n-fold cross-validation (i.e. 5 fold cross-validation) need to applied to assess the prediction results. The results will be evaluated with the so-called Area Under Curve (AUC, http://en.wikipedia.org/wiki/Receiver_operating_characteristic). It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result.