Information about Predicting Customer Conversion for HomeSite Insurance

Published on November 21, 2016

Author: VamshiVennamaneni

Source: slideshare.net

2. Problem Statement Anticipated Business Value Homesite is a leading provider of home owners insurance looking for dynamic rate conversion model that would give them : Confidence that a quote will turn to purchase Understand the impact of price changes Maintain an ideal portfolio of customer segments Predicting the likelihood of a customer purchasing an insurance plan from Homesite insurance

3. Data - Source and Categories Variable Categories • Personal : 87 • Geography: 126 • Coverage: 16 • Property: 47 • Sales: 17 • Unknown: 7

4. Data : 299 variables of which 51 are categorical 260K observation 0 5 10 15 20 25 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 Levels in each of the categorical variables

5. Initial thoughts From the data, we identify that dependent variable is “QuoteConversion_Flag” The model is a logistic regression function of the 298 variables As the number of variables are too high, dimensionality reduction techniques like PCA should be employed The Eigen vectors from PCA should be the input for logistic regression Data Analysis Dimension Reduction Logistic Regression

6. Data Analysis – Hurdles Huge • Dataset with 260k observations (Customers) Null • Nearly 100K observations with missing fields. On deletion, we had 170k observations Levels • Categorical variables with multiple levels. No specific pattern and 368 levels in total DD • Data dictionary(DD) not provided due to security issues

7. Dimension Reduction Techniques Business / Domain Knowledge Application Specific knowledge on the business would help us to eliminate a few variables or assign weightages to prioritize the important ones. In this scenario, lack of data dictionary does not give information on the type of variable. Low Variance/High Correlation Variables with low variances do not improve the power of the model and hence can be eliminated Variables with high correlation carry the same information and hence reduce the power of the model. So, they can also be eliminated In this scenario, high correlation/low variance variable search is inappropriate due to large number of dimensions, no prior information on the variables

8. Dimension Reduction Techniques Principal Component Analysis In this technique, variables are transformed into a new set of variables, which are linear combination of original variables. These new set of variables are known as principal components On running PCA for our data, 99% of variance explained by more than 125 components 95% of variance explained by nearly 75 components 90% of variance explained by nearly 55 components The number of components are too high to identify significant variables contributing to them Scree plot is nearly flat after 7 components

9. PCA results

10. Dimension Reduction Techniques Principal Component Analysis

11. Dimension Reduction Techniques Factor Analysis on mixed data library(FactoMineR) train <- read.csv("train.csv") Out<-FAMD(train, ncp=15, graph = TRUE, sup.var = NULL, ind.sup = NULL, axes = c(1,2), row.w = NULL, tab.comp = NULL) Train <- Train [sample(nrow(Train ),50000),] We have sampled 50000 observations from the train data and the outcomes were not very significant

12. FAMD() Results The 15 components that were an outcome of FADM is below Also, the plot of the factors is very much cluttered

13. Gradient Boosting: XGBoost Gradient Boosting is a machine learning technique used for regression and classification proble ms, typically used when there are weak predictors XGBoost is short for eXtreme Gradient Boosting A variant of gradient boosting machine (Tree based model)

14. XGBoost : Perfect fit Handles missing values Considers all variables equally Logistic regression Minimizes loss and improves the power of model Handles categorical variables with any number of levels

15. XGBoost Model Diagram Feature Selection Learning Training Testing

16. Data & Feature selection #Data Classification train<-train1[1:175000,] test<-train1[175001:nrow(train1),] #Feature selection feature.names <- names(train)[c(3:298)] # Considering all the features for (f in feature.names) { if (class(train[[f]])=="character") { levels <- unique(c(train[[f]], test[[f]])) train[[f]] <- as.integer(factor(train[[f]], levels=levels)) test[[f]] <- as.integer(factor(test[[f]], levels=levels)) }

17. XGBoost Parameters Objective : Regression, Classification Booster: gbtree, Eta: Evaluation Metric: Max depth: Early.stop.round : Nrounds:

18. Training : param <- list( objective = "binary:logistic", booster = "gbtree", eta = 0.020, max_depth = 7) clf <- xgb.train( params = param, eval_metric = “auc", data = dtrain, nrounds = 100, early.stop.round = 70, watchlist = watchlist)

19. Validating Model Predicting the test dataset #Applying the predict function to get AUC scores for each quote number pred1 <- predict(clf, data.matrix(test[,feature.names])) output<- data.frame(QuoteNumber=test$QuoteNumber, QuoteConversion_Flag=pred1) Classification #The Probability predictions for each of the quote numbers is listed If prob_scores > 0.525 for(i in 1:nrow(submission)) { if(submission$QuoteConversion_Flag1[i] > 0.525) submission$newflag[i]=1 else submission$newflag[i]=0 } write.csv(submission, "xgb1_Allnames.csv")

20. Accuracy #Comparing with actual and calculating the accuracy submission$actual<-test$QuoteConversion_Flag match=0 for(i in 1:nrow(submission)) { if(submission$newflag[i] == submission$actual[i]) match=match+1 } accuracy = 100*match/nrow(submission) 92.33

21. THANK YOU ! QUESTIONS ????

## Add a comment