0 R Squared

If your paper is about explaining pain levels in humans, you will have a tiny R-squared because lots of things cause pain and getting kicked in the face is very rare (except maybe in Russia). 4 years ago # QUOTE 0 Jab 0 No Jab! R is always between -1 and 1 inclusive. The R-squared value, denoted by R2, is the square of the correlation. It measures the proportion of variation in the dependent variable that can be attributed to the independent variable. The R-squared value R 2 is always between 0 and 1 inclusive.

On Thursday, October 16, 2015, a disbelieving student posted on Reddit My stats professor just went on a rant about how R-squared values are essentially useless, is there any truth to this? It attracted a fair amount of attention, at least compared to other posts about statistics on Reddit.

It turns out the student’s stats professor was Cosma Shalizi of Carnegie Mellon University. Shalizi provides free and open access to his class lecture materials so we can see what exactly he was “ranting” about. It all begins in Section 3.2 of his Lecture 10 notes.

In case you forgot or didn’t know, R-squared is a statistic that often accompanies regression output. It ranges in value from 0 to 1 and is usually interpreted as summarizing the percent of variation in the response that the regression model explains. So an R-squared of 0.65 might mean that the model explains about 65% of the variation in our dependent variable. Given this logic, we prefer our regression models have a high R-squared. Shalizi, however, disputes this logic with convincing arguments.

In R, we typically get R-squared by calling the summary function on a model object. Here’s a quick example using simulated data:

One way to express R-squared is as the sum of squared fitted-value deviations divided by the sum of squared original-value deviations:

$$R^{2} = frac{sum (hat{y} – bar{hat{y}})^{2}}{sum (y – bar{y})^{2}} $$

We can calculate it directly using our model object like so:

Now let’s take a look at a few of Shalizi’s statements about R-squared and demonstrate them with simulations in R.

1. R-squared does not measure goodness of fit. It can be arbitrarily low when the model is completely correct. By making (sigma^{2}) large, we drive R-squared towards 0, even when every assumption of the simple linear regression model is correct in every particular.

What is (sigma^{2})? When we perform linear regression, we assume our model almost predicts our dependent variable. The difference between “almost” and “exact” is assumed to be a draw from a Normal distribution with mean 0 and some variance we call (sigma^{2}).

Shalizi’s statement is easy enough to demonstrate. The way we do it here is to create a function that (1) generates data meeting the assumptions of simple linear regression (independent observations, normally distributed errors with constant variance), (2) fits a simple linear model to the data, and (3) reports the R-squared. Notice the only parameter for sake of simplicity is sigma. We then “apply” this function to a series of increasing (sigma) values and plot the results.

Sure enough, R-squared tanks hard with increasing sigma, even though the model is completely correct in every respect.

2. R-squared can be arbitrarily close to 1 when the model is totally wrong.

Again, the point being made is that R-squared does not measure goodness of fit. Here we use code from a different section of Shalizi’s lecture 10 notes to generate non-linear data.

Now check the R-squared:

It’s very high at about 0.85, but the model is completely wrong. Using R-squared to justify the “goodness” of our model in this instance would be a mistake. Hopefully one would plot the data first and recognize that a simple linear regression in this case would be inappropriate.

3. R-squared says nothing about prediction error, even with (sigma^{2}) exactly the same, and no change in the coefficients. R-squared can be anywhere between 0 and 1 just by changing the range of X. We’re better off using Mean Square Error (MSE) as a measure of prediction error.

MSE is basically the fitted y values minus the observed y values, squared, then summed, and then divided by the number of observations.

Let’s demonstrate this statement by first generating data that meets all simple linear regression assumptions and then regressing y on x to assess both R-squared and MSE.

Now repeat the above code, but this time with a different range of x. Leave everything else the same:

The R-squared falls from 0.94 to 0.15 but the MSE remains the same. In other words the predictive ability is the same for both data sets, but the R-squared would lead you to believe the first example somehow had a model with more predictive power.

4. R-squared cannot be compared between a model with untransformed Y and one with transformed Y, or between different transformations of Y. R-squared can easily go down when the model assumptions are better fulfilled.

Let’s examine this by generating data that would benefit from transformation. Notice the R code below is very much like our previous efforts but now we exponentiate our y variable.

R-squared is very low and our residuals vs. fitted plot reveals outliers and non-constant variance. A common fix for this is to log transform the data. Let’s try that and see what happens:

The diagnostic plot looks much better. Our assumption of constant variance appears to be met. But look at the R-squared:

It’s even lower! This is an extreme case and it doesn’t always happen like this. In fact, a log transformation will usually produce an increase in R-squared. But as just demonstrated, assumptions that are better fulfilled don’t always lead to higher R-squared. And hence R-squared cannot be compared between models.

5. It is very common to say that R-squared is “the fraction of variance explained” by the regression. [Yet] if we regressed X on Y, we’d get exactly the same R-squared. This in itself should be enough to show that a high R-squared says nothing about explaining one variable by another.

This is the easiest statement to demonstrate:

Does x explain y, or does y explain x? Are we saying “explain” to dance around the word “cause”? In a simple scenario with two variables such as this, R-squared is simply the square of the correlation between x and y:

Why not just use correlation instead of R-squared in this case? But then again correlation summarizes linear relationships, which may not be appropriate for the data. This is another instance where plotting your data is strongly advised.

Let’s recap:

R-squared does not measure goodness of fit.
R-squared does not measure predictive error.
R-squared does not allow you to compare models using transformed responses.
R-squared does not measure how one variable explains another.

0 R Squared =

And that’s just what we covered in this article. Shalizi gives even more reasons in his lecture notes. And it should be noted that Adjusted R-squared does nothing to address any of these issues.

So is there any reason at all to use R-squared? Shalizi says no. (“I have never found a situation where it helped at all.”) No doubt, some statisticians and Redditors might disagree. Whatever your view, if you choose to use R-squared to inform your data analysis, it would be wise to double-check that it’s telling you what you think it’s telling you.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay Ford
Statistical Research Consultant
University of Virginia Library
Oct 17, 2015

When developing more complex models it is often desirable toreport a p-value for the model as a whole as well as an R-squarefor the model.

p-values formodels

The p-value for a model determines the significanceof the model compared with a null model. For a linear model, the null model isdefined as the dependent variable being equal to its mean. So, the p-valuefor the model is answering the question, Does this model explain the datasignificantly better than would just looking at the average value of thedependent variable?

R-squaredand pseudo R-squared

The R-squared value is a measure of how well the modelexplains the data. It is an example of a goodness-of-fit statistic.

R-squared for linear (ordinary least squares) models

In R, models fit with the lm function are linearmodels fit with ordinary least squares (OLS). For these models, R-squaredindicates the proportion of the variability in the dependent variable that isexplained by model. That is, an R-squared of 0.60 indicates that 60% ofthe variability in the dependent variable is explained by the model.

Pseudo R-squared

For many types of models, R-squared is not defined. These include relatively common models like logistic regression and the cumulativelink models used in this book. For these models, pseudo R-squaredmeasures can be calculated. A pseudo R-squared is not directlycomparable to the R-squared for OLS models. Nor can it can beinterpreted as the proportion of the variability in the dependent variable thatis explained by model. Instead, pseudo R-squared measures are relativemeasures among similar models indicating how well the model explains the data.

This book uses three pseudo R-squared measures:McFadden, Cox and Snell (also referred to as ML), Nagelkerke (also referred toas Cragg and Uhler).

In general I favor the Nagelkerke pseudo R-squared,but there is no agreement as to which pseudo R-squared measurementshould be used.

p-values and R-squared values.

p-values and R-squared values measuredifferent things. The p-value indicates if there is a significantrelationship described by the model, and the R-squared measures thedegree to which the data is explained by the model. It is therefore possibleto get a significant p-value with a low R-squared value. Thisoften happens when there is a lot of variability in the dependent variable, butthere are enough data points for a significant relationship to be indicated.

Packages used in this chapter

The packages used in this chapter include:

• psych

• lmtest

• boot

• rcompanion

The following commands will install these packages if theyare not already installed:

if(!require(psych)){install.packages('psych')}
if(!require(lmtest)){install.packages('lmtest')}
if(!require(boot)){install.packages('boot')}
if(!require(rcompanion)){install.packages('rcompanion')}

Example of model p-value, R-squared,and pseudo R-squared

The following example uses some hypothetical data of asample of people for which typing speed (Words.per.minute) and age weremeasured. After plotting the data, we decide to construct a polynomial modelwith Words.per.minute as the dependent variable and Age and Age²as the independent variables. Notice in this example that all variables aretreated as interval/ratio variables, and that the independent variables are also continuous variables.

The data will first be fit with a linear model with the lmfunction. Passing this model to the summary function will display the p-valuefor the model and the R-squared value for the model.

The same data will then be fit with a generalized linearmodel with the glm function. This type of model allows morelatitude in the types of data that can be fit, but in this example, we’ll usethe family=gaussian option, which will mimic the model fit with the lmfunction, though the underlying math is different.

Importantly, the summary of the glm functiondoes not produce a p-value for the model nor an R-squared for themodel.

For the model fit with glm, the p-value can bedetermined with the anova function comparing the fitted model to a nullmodel. The null model is fit with only an intercept term on the right side of themodel. As an alternative, the nagelkerke function described below alsoreports a p-value for the model, using the likelihood ratio test.

There is no R-squared defined for a glmmodel. Instead a pseudo R-squared can be calculated. The function nagelkerkeproduces pseudo R-squared values for a variety of models. It reports threetypes: McFadden, Cox and Snell, and Nagelkerke. In general I recommend usingthe Nagelkerke measure, though there is no agreement on which pseudo R-squaredmeasurement to use, if any at all.

The Nagelkerke is the same as the Cox and Snell,except that the value is adjusted upward so that the Nagelkerke has a maximumvalue of 1.

It has been suggested that a McFadden value of 0.2–0.4 indicates agood fit.

Note that these models makes certain assumptions about thedistribution of the data, but for simplicity, this example will ignore the needto determine if the data met these assumptions.

Input =('
Age Words.per.minute
12 23
12 32
12 25
13 27
13 30
15 29
15 33
16 37
18 29
22 33
23 37
24 33
25 43
27 35
33 30
42 25
53 22
')
Data = read.table(textConnection(Input),header=TRUE)
### Check the data frame
library(psych)
headTail(Data)
str(Data)

summary(Data)
### Remove unnecessary objects
rm(Input)

Plot the data

plot(Words.per.minute ~ Age,
data = Data,
pch=16,
xlab = 'Age',
ylab = 'Words per minute')

Prepare data

### Create new variable for the square of Age
Data$Age2 = Data$Age ^ 2
### Double check data frame
library(psych)
headTail(Data)

Age Words.per.minute Age2
1 12 23 144
2 12 32 144
3 12 25 144
4 13 27 169
... ... ... ...
14 27 35 729
15 33 30 1089
16 42 25 1764
17 53 22 2809

Linear model

model = lm (Words.per.minute ~ Age + Age2,
data=Data)
summary(model) ### Shows coefficients,
### p-value for model, and R-squared

Multiple R-squared: 0.5009, Adjusted R-squared: 0.4296
F-statistic: 7.026 on 2 and 14 DF, p-value: 0.00771
### p-value and (multiple) R-squared value

Simple plot of data and model

For bivariate data, the function plotPredy will plotthe data and the predicted line for the model. It also works for polynomialfunctions, if the order option is changed.

library(rcompanion)
plotPredy(data = Data,
y = Words.per.minute,
x = Age,
x2 = Age2,
model = model,
order = 2,
xlab = 'Words per minute',
ylab = 'Age')

Generalized linear model

model = glm (Words.per.minute ~ Age + Age2,
data = Data,
family = gaussian)
summary(model) ### Shows coefficients

Calculate p-value for model

In R, the most common way to calculate the p-valuefor a fitted model is to compare the fitted model to a null model with the anovafunction. The null model is usually formulated with just a constant on theright side.

null = glm (Words.per.minute ~ 1, ### Createnull model
data = Data, ### withonly a constant on the right side
family = gaussian)
anova (model,
null,
test='Chisq') ###Tests options 'Rao', 'LRT',
### 'Chisq', 'F','Cp'
### But some work with only some modeltypes

Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 14 243.07
2 16 487.06 -2 -243.99 0.0008882 ***

Calculate pseudo R-squared and p-value for model

An alternative test for the p-value for a fittedmodel, the nagelkerke function will report the p-value for a modelusing the likelihood ratio test.

The nagelkerke function also reports the McFadden,Cox and Snell, and Nagelkerke pseudo R-squared values for the model.

library(rcompanion)
nagelkerke(model)

$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.112227
Cox and Snell (ML) 0.500939
Nagelkerke (Cragg and Uhler) 0.501964
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-2 -5.9077 11.815 0.0027184

Likelihood ratio test for p-value for model

The p-value for a model by the likelihood ratio testcan also be determined with the lrtest function in the lmtestpackage.

library(lmtest)
lrtest(model)

#Df LogLik Df Chisq Pr(>Chisq)
1 4 -46.733
2 2 -52.641 -2 11.815 0.002718 **

Simple plot of data and model

library(rcompanion)
plotPredy(data = Data,
y = Words.per.minute,
x = Age,
x2 = Age2,
model = model,
order = 2,
xlab = 'Words per minute',
ylab = 'Age',
col = 'red') ### line color

Optional analyses: Confidence intervals for R-squared values

It is relatively easy to produce confidence intervals for R-squared values or other results from model fitting, such as coefficientsfor regression terms. This can be accomplished with bootstrapping. Here the boot.cifunction from the boot package is used.

The code below is a little complicated, but relativelyflexible.

Function can contain any function of interest, as longas it includes an input vector or data frame (input in this case) and anindexing variable (index in this case). Stat is set to producethe actual statistic of interest on which to perform the bootstrap (r.squaredfrom the summary of the lm in this case).

The code Function(Data, 1:n) is there simply to testFunction on the data frame Data. In this case, it will producethe output of Function for the first n rows of Data. Since n is defined as the length of the first column in Data,this should return the value for Stat for the whole data frame, if Functionis set up correctly.

Input =('
Age Words.per.minute
12 23
12 32
12 25
13 27
13 30
15 29
15 33
16 37
18 29
22 33
23 37
24 33
25 43
27 35
33 30
42 25
53 22
')
Data = read.table(textConnection(Input),header=TRUE)
Data$Age2 = Data$Age ^ 2

### Check the data frame
library(psych)
headTail(Data)
str(Data)

summary(Data)
### Remove unnecessary objects
rm(Input)

Confidence intervals for r-squared by bootstrap

R-squared value

library(boot)
Function = function(input, index){
Input = input[index,]
Result = lm (Words.per.minute ~ Age + Age2,
data = Input)
Stat = summary(Result)$r.squared
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)

[1] 0.5009385

### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=5000)
mean(Boot$t[,1])

[1] 0.5754582

### Produce confidence interval bybootstrap
boot.ci(Boot,
conf = 0.95,
type = 'perc')
### Options: 'norm','basic', 'stud', 'perc', 'bca','all'

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
Intervals :
Level Percentile
95% ( 0.3796, 0.7802 )
Calculations and Intervals on Original Scale

### Other information
Boot
hist(Boot$t[,1],
col = 'darkgray')

Confidence intervals for pseudor-squared by bootstrap

Pseudo R-squared value

The nagelkerke function produces a list. The seconditem in the list is a matrix named

0 Squared Builders

Pseudo.R.squared.for.model.vs.null.

The third element of this matrix is the value for theNagelkerke pseudo R-squared. So,

nagelkerke(Result, Null)[[2]][3]

yields the value of the Nagelkerke pseudo R-squared.

library(boot)
library(rcompanion)
Function = function(input, index){
Input = input[index,]
Result = glm (Words.per.minute ~ Age + Age2,
data = Input,
family='gaussian')
Null = glm (Words.per.minute ~ 1,
data = Input,
family='gaussian')
Stat = nagelkerke(Result, Null)[[2]][3]
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)

[1] 0.501964

### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=1000)
### In this case, even 1000iterations can take a while
mean(Boot$t[,1])

Can 0 Be Squared

[1] 0.5803598

### Produce confidence interval bybootstrap
boot.ci(Boot,
conf = 0.95,
type = 'perc')
### Options: 'norm','basic', 'stud', 'perc', 'bca','all'

R Squared Value 0

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
Intervals :
Level Percentile
95% ( 0.3836, 0.7860 )
Calculations and Intervals on Original Scale

### Other information
Boot
hist(Boot$t[,1],
col = 'darkgray')

Sin Squared 0

References

Why Is 0 Squared 1

Kabacoff, R.I. Bootstrapping. Quick-R. www.statmethods.net/advstats/bootstrapping.html.