How to Perform and Interpret Regression Analysis

Introduction

Regression is a technique use to predict the value of a dependent variable using one or more independent variables. For example, you can predict a salesperson's total yearly sales (the dependent variable) from his age, education, and years of experience (the independent variables). There are two types of regression analysis namely Simple and Multiple regressions. Simple regression involves two variables, the dependent variable and one independent variable. Multiple regression involves many variables, one dependent variable and many independent variables.

Mathematically, the simple regression equation is as shown below:

y1 = b0 + b1x

Mathematically, the multiple regression equation is as shown below:

y1 = b0 + b1x1 + b2x2 + b3x3 + ... + bnxn

where y1 is the estimated value for y (the dependent variable), b1, b2, b3,... are the partial regression coefficients, x, x1, x2, x3,... are the independent variables and b0 is the regression constant. These coefficients will be generated automatically after running the simple regression procedure.

Residuals

It is important to understand the concept of Residuals. It does not only help you to understand the analysis, they form the basis for measuring the accuracy of the estimates and the extent to which the regression model gives a good account of the collected data. The residual is simply the difference between the actual and the predicted values (i.e. y-y1 ). A simple correlation analysis between y and y1 gives an indication of the accuracy of the model.

Simple Regression

The data shown on Table 1 was collected through a questionnaire survey. Thirty sales people were approached and their ages and total sales values in the preceding year solicited. We want to use the data to illustrate the procedure of simple regression analysis.

Table 1: Ages and sales total

Age

Sales in £000

Age

Sales in £000

Age

Sales in £000

29

195

42

169

38

164

35

145

36

142

32

140

26

114

21

114

29

112

23

105

28

103

27

100

29

95

21

94

25

101

20

78

27

76

24

90

24

65

23

61

20

91

41

50

20

50

19

74

25

126

35

45

19

49

27

50

33

25

18

38

Before we can conduct any statistical procedure the data has to be entered correctly into a suitable statistical package such as SPSS. Using the techniques described in Getting Started with SPSS for Windows, define the variables age and sales, using the labelling procedure to provide more informative names as Age for salesperson and Total sales. Type the data into columns and save under a suitable name such as simreg. Note that all SPSS data set files have the extension .sav. You can leave out the thousand when entering the sales values, but remember to multiply by a thousand when calculating the total sales of a salesperson.

The Simple Regression Procedure

From the menus choose:
Statistics
Regression
Linear...
The Linear regression dialog box will be loaded on the screen as shown below.

Finding the Linear Regression procedure

The Linear Regression dialog box

The two variables names age and sales will appear on the left-hand box. Transfer the dependent variable sales to the Dependent text box by clicking on the variable name and then on the arrow >. Transfer the independent variable age to the Independent text box.

To obtain additional descriptive statistics and residuals analysis click on the Statistics button. The Linear Regression: Statistics dialog box will be loaded on the screen as shown below. Click on the Descriptives check box and then on Continue to return to the Linear Regression dialog box.

The Linear Regression: Statistics dialog box

Residuals analysis can be obtained by clicking on the Plots button. The Linear Regression: Plots dialog box will loaded on the screen as shown below. Click to check the boxes for Histogram and Normal probability plots.

We recommend you plot the residuals against the predicted values. The correct ones for this plots are *zpred and *zresid. Click on *zresid and then on the arrow > to transfer it to the left of the Y: text box. Transfer *zpred to the left of the X: text box. The completed box is as shown below. Click on Continue and then OK to run the regression. Now let's look at the output after running the procedure.

The Linear Regression: Plots dialog box

Output listing for Simple Regression

You will be surprise by the amount of output that the simple regression procedure will generate. We will attempt to explain and interpret the output for you. You should be able to interpret the output of any statistical procedure that you generate.

The descriptive statistics and correlation coefficient are shown on the tables below. The mean total sales in a year for all the 30 salespersons is £95370 (i.e. 95.37x1000). The mean age is 27.20 and N stand for the sample size. In the correlation table, the 0.393 gives the correlation between total sales value and age and it is significant at 5% level (0.016 < 0.05).

The table below shows which variables has been entered or removed from the analysis. It is more relevant to multiple regression.

The next table below gives a summary of the model. The R value stand for the correlation coefficient which is the same as r. R is use mainly to refer to multiple regression while r refers to simple regression. There is also an ANOVA table, which test if the two variables have a linear relationship. In this example, the F value of 5.109 is highly significant indicating a linear relationship between the two variables. Only an examination of the scatter plot of the variables can ensure that the relationship is genuinely linear.

The table below is the main aim of a regression analysis, because it contains the regression equation. The values of the regression coefficient and constant are given in column B of the table. Don't forget to multiply the constant and coefficient by a thousand. The equation is, therefore,

Total sales value = 28595 + 2455 x (age)

Thus a salesperson who is 24 years old would be predicted to generate yearly sales total of

28595 + 2455 x 24 = £87515

Notice from the data that the 24 old sales person actually generate £90000 worth of sales. The residual is £90000 - £87515 = £2485.

The remaining output listing relate to the residuals analysis. The table below contains the residuals statistics. It comprises the unstandardized predicted and residuals values. It also contains the standardized (std.) predicted and residuals values. Standardized means that the values have been scale such that they have a mean of 0 and a standard deviation of 1.

The histogram of the standardized residual is shown below. The bars shows the frequencies while the superimposed curve represent the ideal normal distribution for the residuals.

The next plot shown below is a cumulative probability plot of standardized residuals. If all the points lies on the diagonal, it means the residual are normally distributed.

The last plot of the output listing (shown below) is a scatter plot of the predicted scores against residuals. No pattern is indicated, confirming the linearity of the relationship.

So far, we have looked at how to generate and interpret a simple regression analysis. Now let us look at how to generate and interpret a multiple regression analysis.

Multiple Regression

It has already been mentioned that multiple regression involves two or more independent variables and one dependent variable. The data use for the simple regression above, will be extended and use to illustrate multiple regression. Two extra variables, the salesperson's education (educ) and years of experience (years) have been added. See Table 2 below. The salespersons education were assess by their scores obtained on a relevant academic project.

In discussing the output listing from the multiple regression procedure, there are two main questions that we need to address:

  1. How does the addition of more independent variables affect the accurate prediction of total sales?
  2. How can we determine the relative importance of the new variables?
Data Entry

Restore the file named simreg into the Data Editor window. Define and label the two new variables. Type in the new data. Save the file under a new name such as mulreg.

Table 2: Extension of Table 1

sales age educ years sales age
195 29 65 10 76 27 75 8
145 35 84 14 61 23 65 4
114 26 76 7 50 20 70 3
105 23 60 5 45 35 68 15
95 29 84 11 25 33 78 13
78 20 79 3 164 38 64 17
65 24 77 5 140 32 69 10
50 41 70 15 112 29 60 9
126 25 74 6 100 27 68 8
50 27 72 7 101 25 61 6
169 42 60 16 90 24 65 5
142 36 68 14 91 20 82 3
114 21 50 7 74 19 60 2
103 28 69 8 49 19 54 3
94 21 72 3 38 18 75 2

After the data has been entered successfully into the Data Editor, it is time to conduct some analysis. The multiple regression procedure is the same as the simple regression procedure except that the Linear Regression dialog box is filled out as shown below. Transfer the variables names age, educ and years into the Independent text box, by highlighting them and clicking on the appropriate arrow (>) button. The Dependent variable text box must contain the variable sales. For Method, select Enter and click OK to run the procedure. The Enter Method enter the variables in the block in a single step. Other entry methods include Backward, Forward, and Stepwise.

The Linear regression dialog box fill out for multiple regression

Output listing for multiple regression

The first table from the multiple regression procedure is shown below. It shows what method was selected to enter the variables. It also shows all the variables entered.

The next table shown below gives a summary of the regression model. The multiple regression coefficient (R) is 0.447. Recalling that for the simple regression case R was 0.393, we see that the answer to the question whether adding more independent variables improves the predictive power of the regression equation is 'yes'.

The next table from the multiple regression output listing is the ANOVA shown below.

The final table is the coefficients of the variables. From column B on the table, we can write the regression equation of total sales upon age, educ, years as:

Total sales = 124906 + 312x(age) + 3343x(years) - 935x(educ)

Note the coefficients have been multiplied by a thousand.

This equation tells us nothing about the relative important of each variable. The values for the coefficients reflect the original units in which the variables were measured. Therefore, we can not conclude that years of experience with a larger coefficient than age is a more important variable. The column on the table headed Beta gives us more information about the relative importance of the variables. Beta contains standardized coefficients. A change of one standard deviation in year will produce a change of 0.365 standard deviation in total sales. A change of one standard deviation in age will produce a change of only 0.05 standard deviation in total sales. It should also be noted that the independent variable with the largest beta weight also has the largest correlation with the dependent variable.

Conclusion

You should now be able to conduct and interpret a regression analysis, be it simple or multiple regression.