Project
1: Customer Relationship Prediction
Customer
Relationship Management (CRM) is a common strategy for managing a company's
interactions with customers,
clients and sales prospects. It tries to organize, automate, and synchronize
business processes especially sales activities. The main goals are to find,
attract, and win new clients, improve and retain those the company already has
and reduce the costs of marketing and client service. One of the most practical
ways is to produce scores for customers. Target score values to be predicted in
this project are appetency and up-selling. The score is computed by
building a prediction model with input variables which describe instances.
Then, scores can be used to personalize the customer relationship.
Task:
For this project the task is to estimate the appetency and up-selling probability of customers. Hence there are two target values to be predicted as follows:
1) Appetency:Appetency is the propensity to buy a service or a product.
2) Up-selling:Up-selling is a sales technique where they attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale.
Data
The data are available with 230 numerical and categorical variables. The first 190 variables are numerical and the last 40 are categorical. Training and test Data sets can be downloaded here, in 3 different formats:
1) CRP_not_numerical_data.zip(This file contains all 230 variables. The first 190 are numerical and the last 40 categorical
2) CRP_numerical version_matlab_format.zip(*.mat file)
3) CRP_numerical version_txt_format.zip (*.txt file)(This file contains the convereted numerical values of the last 40 categorical variables.)
CRP_data.zip (all three version)
True task labels (real binary targets) are also available from here:
train_appentency_labels.labels, train_upselling_labels.labels
Data Format
The datasets use a format similar as that
of the text export format from relational databases. You can easily open them
with excel or even notepad.
· One header lines with the variables names (Var1, Var2, Var3, ... , Var230)
· One line per instance (there are 50000 instances)
· Separator tabulation between the values
·
There are missing values (consecutive
tabulations)
Hint:The header line is present only in the first chunk of datasets. The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that appetency and up-selling are two separate binary classification problems. The target values are +1 or -1.Examples having +1 (resp. -1) target values are referred as positive (resp. negative) examples. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables.
Evaluation
True Positive Rate and False Positive
Rate
The main objective of
the project is to make good predictions of the target variables. The prediction
of each target variable is thought of as a separate classification problem. The
results of classification, obtained by thresholding
the prediction score, may be represented in a confusion matrix, where TP (True
Positive), FN (False Negative), TN (True Negative) and FP (False Positive)
represent the number of examples falling into each possible outcome:
|
Prediction |
||
Class +1 |
Class -1 |
||
Truth |
Class +1 |
TP |
FN |
Class _1 |
FP |
TN |
Any
sort of numeric prediction score is allowed, larger numerical values indicating
higher confidence in positive class membership. Hence, the True Positive Rate and
the False Positive Rate are as:
True
Positive Rate = TP/ (TP + FN)
False Positive Rate =
FP / (TN + FP)
Since there is only
one data set, a n-fold cross-validation (i.e. 5 fold
cross-validation) need to applied to assess the prediction results. The
results will be evaluated with the so-called Area Under
Curve (AUC, http://en.wikipedia.org/wiki/Receiver_operating_characteristic).
It corresponds to the area under the curve obtained by plotting sensitivity
against specificity by varying a threshold on the prediction values to determine
the classification result.