# vtreat package

#### 2017-10-16

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. A formal article on the method can be found here: arXiv:1611.09477 stat.AP.

A ‘vtreat’ clean data frame:

• Only has numeric columns (other than the outcome).
• Has no Infinite/NA/NaN in the effective variable columns.

To achieve this a number of techniques are used. Principally:

For more details see: the ‘vtreat’ article and update.

The use pattern is:

1. Use designTreatmentsC() or designTreatmentsN() to design a treatment plan
2. Use the returned structure with prepare() to apply the plan to data frames.

The main feature of ‘vtreat’ is that all data preparation is “y-aware”: it uses the relations of effective variables to the dependent or outcome variable to encode the effective variables.

The structure returned from designTreatmentsN() or designTreatmentsC() includes a list of “treatments”: objects that encapsulate the transformation process from the original variables to the new “clean” variables.

In addition to the treatment objects designTreatmentsC() and designTreatmentsN() also return a data frame named scoreFrame which contains columns:

• varName: name of new variable
• origName: name of original variable that the variable was derived from (can repeat).
• code: what time of treatment was performed to create the derived variable (useful for filtering).
• varMoves: logical TRUE if the variable varied during training; only variables that move will be in the treated frame.
• sig: linear significnace of regerssing derived variable against a 0/1 indicator target for numeric targets, logistic regression significance otherwise.
• needsSplit: is the variable a sub model and require out of sample scoring.

In all cases we have two undesirable upward biases on the scores:

• The treated variables view the training data during construction (for setting of NA values, missing values, levels, and more). So this gives an upward bias when trying to measure treated variable utility on training data. Until the data set is at least 1000 good rows we ignore this effect. After 1000 rows we design variables on a pseudo-randomly chosen 80% of the rows and score on the complimentary 20% of the rows.
• The scoring procedure itself involves a fit (linear regression for regression or logistic regression for classification). In each of these cases we would like the scoring itself to only be evaluated on variables constructed on held-out data. This is simulated through a cross-validation procedure.

‘vtreat’ uses a number of cross-training and jackknife style procedures to try to mitigate these effects. The suggested best practice (if you have enough data) is to split your data randomly into at least the following disjoint data sets:

• Encoding Calibration : a data set used for the designTreatmentsC() or designTreatmentsN() step and not used again for training or test.
• Training : a data set used (after prepare()) for training your model.
• Test : a data set used (after prepare()) for estimating your model’s out of training performance.

Taking the extra step to perform the designTreatmentsC() or designTreatmentsN() on data disjoint from training makes the training data more exchangeable with test and avoids the issue that ‘vtreat’ may be hiding a large number of degrees of freedom in variables it derives from large categoricals.

Some trivial execution examples (not demonstrating any cal/train/test split) are given below. Variables that do not move during hold-out testing are considered “not to move.”

## A Categorical Outcome Example

library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
head(dTrainC)
##      x  z     y
## 1    a  1 FALSE
## 2    a  2 FALSE
## 3    a  3  TRUE
## 4    b  4 FALSE
## 5    b NA  TRUE
## 6 <NA>  6  TRUE
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestC)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "designing treatments Mon Oct 16 19:20:51 2017"
## [1] "designing treatments Mon Oct 16 19:20:51 2017"
## [1] " have level statistics Mon Oct 16 19:20:51 2017"
## [1] "design var x Mon Oct 16 19:20:51 2017"
## [1] "design var z Mon Oct 16 19:20:51 2017"
## [1] " scoring treatments Mon Oct 16 19:20:51 2017"
## [1] "have treatment plan Mon Oct 16 19:20:52 2017"
## [1] "rescoring complex variables Mon Oct 16 19:20:52 2017"
## [1] "done rescoring complex variables Mon Oct 16 19:20:52 2017"
print(treatmentsC)
## $treatments ##$treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"
##
## $treatments[[2]] ## [1] "vtreat 'Prevalence Code'('x'(integer,factor)->character->'x_catP')" ## ##$treatments[[3]]
## [1] "vtreat 'Bayesian Impact Code'('x'(integer,factor)->character->'x_catB')"
##
## $treatments[[4]] ## [1] "vtreat 'Scalable pass through'('z'(double,numeric)->numeric->'z_clean')" ## ##$treatments[[5]]
##
##
## $scoreFrame ## varName varMoves rsq sig needsSplit extraModelDegrees ## 1 x_lev_NA TRUE 0.19087450 0.2076623 FALSE 0 ## 2 x_lev_x.a TRUE 0.08170417 0.4097258 FALSE 0 ## 3 x_lev_x.b TRUE 0.00000000 1.0000000 FALSE 0 ## 4 x_catP TRUE 0.24340634 0.1547700 TRUE 2 ## 5 x_catB TRUE 0.05070201 0.5160763 TRUE 2 ## 6 z_clean TRUE 0.25792985 0.1429977 FALSE 0 ## 7 z_isBAD TRUE 0.19087450 0.2076623 FALSE 0 ## origName code ## 1 x lev ## 2 x lev ## 3 x lev ## 4 x catP ## 5 x catB ## 6 z clean ## 7 z isBAD ## ##$outcomename
## [1] "y"
##
## $vtreatVersion ## [1] '1.0.1' ## ##$outcomeType
## [1] "Binary"
##
## $outcomeTarget ## [1] TRUE ## ##$meanY
## [1] 0.5
##
## $splitmethod ## [1] "kwaycrossystratified" ## ## attr(,"class") ## [1] "treatmentplan" print(treatmentsC$treatments[[1]])
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"

Here we demonstrate the optional scaling feature of prepare(), which scales and centers all significant variables to mean 0, and slope 1 with respect to y: In other words, it rescales the variables to “y-units”. This is useful for downstream principal components analysis. Note: variables perfectly uncorrelated with y necessarily have slope 0 and can’t be “scaled” to slope 1, however for the same reason these variables will be insignificant and can be pruned by pruneSig.

scale=FALSE by default.

dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
head(dTrainCTreated)
##   x_lev_NA  x_lev_x.a x_lev_x.b x_catP      x_catB     z_clean z_isBAD
## 1     -0.1 -0.1666667         0   -0.2 -0.11976374 -0.38648649    -0.1
## 2     -0.1 -0.1666667         0   -0.2 -0.11976374 -0.21081081    -0.1
## 3     -0.1 -0.1666667         0   -0.2 -0.11976374 -0.03513514    -0.1
## 4     -0.1  0.1666667         0    0.1 -0.07564865  0.14054054    -0.1
## 5     -0.1  0.1666667         0    0.1 -0.07564865  0.00000000     0.5
## 6      0.5  0.1666667         0    0.4  0.51058851  0.49189189    -0.1
##       y
## 1 FALSE
## 2 FALSE
## 3  TRUE
## 4 FALSE
## 5  TRUE
## 6  TRUE
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
##      x_lev_NA     x_lev_x.a     x_lev_x.b        x_catP        x_catB
## -6.938894e-18  0.000000e+00  0.000000e+00  1.850372e-17  1.387779e-17
##  9.251859e-18 -6.938894e-18
# all slopes should be 1 for variables with dTrainCTreated$scoreFrame$sig<1
sapply(varsC,function(c) { glm(paste('y',c,sep='~'),family=binomial,
data=dTrainCTreated)$coefficients[[2]]}) ## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB z_clean z_isBAD ## 31.619223 4.158883 NA 4.698112 15.815409 5.733441 31.619223 dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE) head(dTestCTreated) ## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catB z_clean z_isBAD ## 1 -0.1 -0.1666667 0 -0.2 -0.11976374 1.194595 -0.1 ## 2 -0.1 0.1666667 0 0.1 -0.07564865 2.951351 -0.1 ## 3 -0.1 0.1666667 0 0.7 -0.07564865 4.708108 -0.1 ## 4 0.5 0.1666667 0 0.4 0.51058851 0.000000 0.5 ## A Numeric Outcome Example # numeric example dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA), z=c(1,2,3,4,5,NA,7),y=c(0,0,0,1,0,1,1)) head(dTrainN) ## x z y ## 1 a 1 0 ## 2 a 2 0 ## 3 a 3 0 ## 4 a 4 1 ## 5 b 5 0 ## 6 b NA 1 dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA)) head(dTestN) ## x z ## 1 a 10 ## 2 b 20 ## 3 c 30 ## 4 <NA> NA treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y') ## [1] "designing treatments Mon Oct 16 19:20:52 2017" ## [1] "designing treatments Mon Oct 16 19:20:52 2017" ## [1] " have level statistics Mon Oct 16 19:20:52 2017" ## [1] "design var x Mon Oct 16 19:20:52 2017" ## [1] "design var z Mon Oct 16 19:20:52 2017" ## [1] " scoring treatments Mon Oct 16 19:20:52 2017" ## [1] "have treatment plan Mon Oct 16 19:20:52 2017" ## [1] "rescoring complex variables Mon Oct 16 19:20:52 2017" ## [1] "done rescoring complex variables Mon Oct 16 19:20:52 2017" print(treatmentsN) ##$treatments
## $treatments[[1]] ## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')" ## ##$treatments[[2]]
## [1] "vtreat 'Prevalence Code'('x'(integer,factor)->character->'x_catP')"
##
## $treatments[[3]] ## [1] "vtreat 'Scalable Impact Code'('x'(integer,factor)->character->'x_catN')" ## ##$treatments[[4]]
## [1] "vtreat 'Deviation Fact'('x'(integer,factor)->character->'x_catD')"
##
## $treatments[[5]] ## [1] "vtreat 'Scalable pass through'('z'(double,numeric)->numeric->'z_clean')" ## ##$treatments[[6]]
##
##
## $scoreFrame ## varName varMoves rsq sig needsSplit extraModelDegrees ## 1 x_lev_NA TRUE 0.222222222 0.2855909 FALSE 0 ## 2 x_lev_x.a TRUE 0.173611111 0.3524132 FALSE 0 ## 3 x_lev_x.b TRUE 0.008333333 0.8456711 FALSE 0 ## 4 x_catP TRUE 0.213674336 0.2963404 TRUE 2 ## 5 x_catN TRUE 0.001432768 0.9357866 TRUE 2 ## 6 x_catD TRUE 0.173611111 0.3524132 TRUE 2 ## 7 z_clean TRUE 0.336111111 0.1724763 FALSE 0 ## 8 z_isBAD TRUE 0.222222222 0.2855909 FALSE 0 ## origName code ## 1 x lev ## 2 x lev ## 3 x lev ## 4 x catP ## 5 x catN ## 6 x catD ## 7 z clean ## 8 z isBAD ## ##$outcomename
## [1] "y"
##
## $vtreatVersion ## [1] '1.0.1' ## ##$outcomeType
## [1] "Numeric"
##
## $outcomeTarget ## [1] "y" ## ##$meanY
## [1] 0.4285714
##
## $splitmethod ## [1] "kwaycrossystratified" ## ## attr(,"class") ## [1] "treatmentplan" dTrainNTreated <- prepare(treatmentsN,dTrainN, pruneSig=c(),scale=TRUE) head(dTrainNTreated) ## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN x_catD ## 1 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714 ## 2 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714 ## 3 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714 ## 4 -0.0952381 -0.1785714 -0.02857143 -0.2 -0.17857143 -0.1785714 ## 5 -0.0952381 0.2380952 0.07142857 0.2 0.07142857 0.2380952 ## 6 -0.0952381 0.2380952 0.07142857 0.2 0.07142857 0.2380952 ## z_clean z_isBAD y ## 1 -0.41904762 -0.0952381 0 ## 2 -0.26190476 -0.0952381 0 ## 3 -0.10476190 -0.0952381 0 ## 4 0.05238095 -0.0952381 1 ## 5 0.20952381 -0.0952381 0 ## 6 0.00000000 0.5714286 1 varsN <- setdiff(colnames(dTrainNTreated),'y') # all input variables should be mean 0 sapply(dTrainNTreated[,varsN,drop=FALSE],mean)  ## x_lev_NA x_lev_x.a x_lev_x.b x_catP x_catN ## -3.965082e-18 0.000000e+00 -2.974054e-18 -5.551115e-17 -3.965082e-18 ## x_catD z_clean z_isBAD ## -9.515810e-17 4.757324e-17 -3.967986e-18 # all slopes should be 1 for variables with treatmentsN$scoreFrame$sig<1 sapply(varsN,function(c) { lm(paste('y',c,sep='~'), data=dTrainNTreated)$coefficients[[2]]}) 
##  x_lev_NA x_lev_x.a x_lev_x.b    x_catP    x_catN    x_catD   z_clean
##         1         1         1         1         1         1         1
##         1
# prepared frame
dTestNTreated <- prepare(treatmentsN,dTestN,
pruneSig=c())
head(dTestNTreated)
##   x_lev_NA x_lev_x.a x_lev_x.b    x_catP      x_catN    x_catD   z_clean
## 1        0         1         0 0.5714286 -0.17857143 0.5000000 10.000000
## 2        0         0         1 0.2857143  0.07142857 0.7071068 20.000000
## 3        0         0         0 0.0000000  0.00000000 0.7071068 30.000000
## 4        1         0         0 0.1428571  0.57142857 0.7071068  3.666667
## 1       0
## 2       0
## 3       0
## 4       1
# scaled prepared frame
dTestNTreatedS <- prepare(treatmentsN,dTestN,
pruneSig=c(),scale=TRUE)
head(dTestNTreatedS)
##     x_lev_NA  x_lev_x.a   x_lev_x.b x_catP        x_catN     x_catD
## 1 -0.0952381 -0.1785714 -0.02857143   -0.2 -1.785714e-01 -0.1785714
## 2 -0.0952381  0.2380952  0.07142857    0.2  7.142857e-02  0.2380952
## 3 -0.0952381  0.2380952 -0.02857143    0.6 -1.586033e-17  0.2380952
## 4  0.5714286  0.2380952 -0.02857143    0.4  5.714286e-01  0.2380952
## 4 0.0000000  0.5714286