Statistical and Machine-learning Data Mining by Ratner Bruce

Statistical and Machine-learning Data Mining by Ratner Bruce

Author:Ratner, Bruce. [Ratner, Bruce.]
Language: eng
Format: epub
Published: 2014-08-14T22:01:36+00:00


a. Worthy of note is that statistics textbooks refer to multicollinear-

ity as a “data problem, but not a weakness in the model.” Students

are taught that model performance is not affected by the condition

of multicollinearity. Multicollinearity is a data problem because

its only affects a clearly assigned contribution of each predictor

variable to the dependent: The assigned contribution of each pre-

dictor variable is muddied. Unfortunately, students are not taught

that model performance is not affected by the condition of mul-

ticollinearity as long as the condition of multicollinearity remains

the same as when the model was initially built. If the condition

is the same as when the model was first built, then implementa-

tion of the model should yield good performance. However, for

every reimplementation of the model after the first, the condition

of multicollinearity has showed in practice not to remain the same.

Hence, I uphold and defend that multicollinearity is a data prob-

lem, and multicollinearity does affect model performance.

2. Average correlation values in the range of 0.35 or less are desirable.

In this situation, a soundly honest assessment of the contributions of

the predictor variables to the performance of the model can be made.

The Average Correlation

233

3. Average correlation values that are greater than 0.35 and less than

0.55 are moderately desirable. In this situation, a somewhat honest

assessment of the contributions of the predictor variables to the per-

formance of the model can be made.

4. Average correlation values that are greater than 0.55 are not desirable

as they indicate the predictor variables are excessively redundant.

In this situation, a questionably honest assessment of the contribu-

tions of the predictor variables to the performance of the model can

be made.

As long as the average correlation value is acceptable (less than 0.40),

the second proposed item of assessing competing models (every modeler

builds several models and must choose the best one) is in play. If a project

session brings forth models within the acceptable range of average cor-

relation values, the model builder uses both the average correlation value

and the set of the individual correlations of predictor variable with depen-

dent variable. The individual correlations indicate the content validity of

the model. Rules of thumb for the values of the individual correlation coef-

ficients are as follows:

1. Values between 0.0 and 0.3 (0.0 and -0.3) indicate poor validity.

2. Values between 0.3 and 0.7 (-0.3 and -0.7) indicate moderate validity.

3. Values between 0.7 and 1.0 (-0.7 and -1.0) indicate a strong validity.

In sum, the model builder uses the average correlation and the indi-

vidual correlations to assess competing predictive models and the impor-

tance of the predictor variables. I continue with the illustration of the

LTV5 model to make sense of these discussions and rules of thumb in the

next section.

13.5.2 Continuing with the illustration of the Average

Correlation with an LTV5 Model

The average correlation of the LTV5 model is 0.33502. The individual cor-

relations of the predictor variables with LTV5 (Table 13.3) indicate the vari-

ables have moderate to strong validity, except for VAR2. The combination of

0.33502 and values of Table 13.3 is compelling for any modeler to be pleased

with the reliability and validity of the LTV5 model.

13.5.3 Continuing with the illustration with a



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.