Regression is a technique use to predict the value of a dependent variable using one or more independent variables. For example, you can predict a salesperson's total yearly sales (the dependent variable) from his age, education, and years of experience (the independent variables). There are two types of regression analysis namely Simple and Multiple regressions. Simple regression involves two variables, the dependent variable and one independent variable. Multiple regression involves many variables, one dependent variable and many independent variables.
Mathematically, the simple regression equation is as shown below:
y1 = b0 + b1x
Mathematically, the multiple regression equation is as shown below:
y1 = b0 + b1x1 + b2x2 + b3x3 + ... + bnxn
where y1 is the estimated value for y (the dependent variable), b1, b2, b3,... are the partial regression coefficients, x, x1, x2, x3,... are the independent variables and b0 is the regression constant. These coefficients will be generated automatically after running the simple regression procedure.
It is important to understand the concept of Residuals. It does not only help you to understand the analysis, they form the basis for measuring the accuracy of the estimates and the extent to which the regression model gives a good account of the collected data. The residual is simply the difference between the actual and the predicted values (i.e. y-y1 ). A simple correlation analysis between y and y1 gives an indication of the accuracy of the model.
The data shown on Table 1 was collected through a questionnaire survey. Thirty sales people were approached and their ages and total sales values in the preceding year solicited. We want to use the data to illustrate the procedure of simple regression analysis.
Table 1: Ages and sales total
Age |
Sales in £000 |
Age |
Sales in £000 |
Age |
Sales in £000 |
29 |
195 |
42 |
169 |
38 |
164 |
35 |
145 |
36 |
142 |
32 |
140 |
26 |
114 |
21 |
114 |
29 |
112 |
23 |
105 |
28 |
103 |
27 |
100 |
29 |
95 |
21 |
94 |
25 |
101 |
20 |
78 |
27 |
76 |
24 |
90 |
24 |
65 |
23 |
61 |
20 |
91 |
41 |
50 |
20 |
50 |
19 |
74 |
25 |
126 |
35 |
45 |
19 |
49 |
27 |
50 |
33 |
25 |
18 |
38 |
Before we can conduct any statistical procedure the data has to be entered correctly into a suitable statistical package such as SPSS. Using the techniques described in Getting Started with SPSS for Windows, define the variables age and sales, using the labelling procedure to provide more informative names as Age for salesperson and Total sales. Type the data into columns and save under a suitable name such as simreg. Note that all SPSS data set files have the extension .sav. You can leave out the thousand when entering the sales values, but remember to multiply by a thousand when calculating the total sales of a salesperson.
From the menus choose:
Statistics
Regression
Linear...
The Linear regression dialog
box will be loaded on the screen as shown below.

The Linear Regression
dialog box

The two variables names age and sales will appear on the left-hand box. Transfer the dependent variable sales to the Dependent text box by clicking on the variable name and then on the arrow >. Transfer the independent variable age to the Independent text box.
To obtain additional descriptive statistics and residuals analysis click on the Statistics button. The Linear Regression: Statistics dialog box will be loaded on the screen as shown below. Click on the Descriptives check box and then on Continue to return to the Linear Regression dialog box.
The Linear Regression:
Statistics dialog box

Residuals analysis can be obtained by clicking on the Plots button. The Linear Regression: Plots dialog box will loaded on the screen as shown below. Click to check the boxes for Histogram and Normal probability plots.
We recommend you plot the residuals against the predicted values. The correct ones for this plots are *zpred and *zresid. Click on *zresid and then on the arrow > to transfer it to the left of the Y: text box. Transfer *zpred to the left of the X: text box. The completed box is as shown below. Click on Continue and then OK to run the regression. Now let's look at the output after running the procedure.
The Linear Regression:
Plots dialog box

You will be surprise by the amount of output that the simple regression procedure will generate. We will attempt to explain and interpret the output for you. You should be able to interpret the output of any statistical procedure that you generate.
The descriptive statistics and correlation coefficient are shown on the tables below. The mean total sales in a year for all the 30 salespersons is £95370 (i.e. 95.37x1000). The mean age is 27.20 and N stand for the sample size. In the correlation table, the 0.393 gives the correlation between total sales value and age and it is significant at 5% level (0.016 < 0.05).


The table below shows which variables has been entered or removed from the analysis. It is more relevant to multiple regression.

The next table below gives a summary of the model. The R value stand for the correlation coefficient which is the same as r. R is use mainly to refer to multiple regression while r refers to simple regression. There is also an ANOVA table, which test if the two variables have a linear relationship. In this example, the F value of 5.109 is highly significant indicating a linear relationship between the two variables. Only an examination of the scatter plot of the variables can ensure that the relationship is genuinely linear.


The table below is the main aim of a regression analysis, because it contains the regression equation. The values of the regression coefficient and constant are given in column B of the table. Don't forget to multiply the constant and coefficient by a thousand. The equation is, therefore,
Total sales value = 28595 + 2455 x (age)
Thus a salesperson who is 24 years old would be predicted to generate yearly sales total of
28595 + 2455 x 24 = £87515
Notice from the data that the 24 old sales person actually generate £90000 worth of sales. The residual is £90000 - £87515 = £2485.

The remaining output listing relate to the residuals analysis. The table below contains the residuals statistics. It comprises the unstandardized predicted and residuals values. It also contains the standardized (std.) predicted and residuals values. Standardized means that the values have been scale such that they have a mean of 0 and a standard deviation of 1.

The histogram of the standardized residual is shown below. The bars shows the frequencies while the superimposed curve represent the ideal normal distribution for the residuals.

The next plot shown below is a cumulative probability plot of standardized residuals. If all the points lies on the diagonal, it means the residual are normally distributed.

The last plot of the output listing (shown below) is a scatter plot of the predicted scores against residuals. No pattern is indicated, confirming the linearity of the relationship.

So far, we have looked at how to generate and interpret a simple regression analysis. Now let us look at how to generate and interpret a multiple regression analysis.
It has already been mentioned that multiple regression involves two or more independent variables and one dependent variable. The data use for the simple regression above, will be extended and use to illustrate multiple regression. Two extra variables, the salesperson's education (educ) and years of experience (years) have been added. See Table 2 below. The salespersons education were assess by their scores obtained on a relevant academic project.
In discussing the output listing from the multiple regression procedure, there are two main questions that we need to address:
Restore the file named simreg into the Data Editor window. Define and label the two new variables. Type in the new data. Save the file under a new name such as mulreg.
Table 2: Extension of Table 1
| sales | age | educ | years | sales | age | ||
| 195 | 29 | 65 | 10 | 76 | 27 | 75 | 8 |
| 145 | 35 | 84 | 14 | 61 | 23 | 65 | 4 |
| 114 | 26 | 76 | 7 | 50 | 20 | 70 | 3 |
| 105 | 23 | 60 | 5 | 45 | 35 | 68 | 15 |
| 95 | 29 | 84 | 11 | 25 | 33 | 78 | 13 |
| 78 | 20 | 79 | 3 | 164 | 38 | 64 | 17 |
| 65 | 24 | 77 | 5 | 140 | 32 | 69 | 10 |
| 50 | 41 | 70 | 15 | 112 | 29 | 60 | 9 |
| 126 | 25 | 74 | 6 | 100 | 27 | 68 | 8 |
| 50 | 27 | 72 | 7 | 101 | 25 | 61 | 6 |
| 169 | 42 | 60 | 16 | 90 | 24 | 65 | 5 |
| 142 | 36 | 68 | 14 | 91 | 20 | 82 | 3 |
| 114 | 21 | 50 | 7 | 74 | 19 | 60 | 2 |
| 103 | 28 | 69 | 8 | 49 | 19 | 54 | 3 |
| 94 | 21 | 72 | 3 | 38 | 18 | 75 | 2 |
After the data has been entered successfully into the Data Editor, it is time to conduct some analysis. The multiple regression procedure is the same as the simple regression procedure except that the Linear Regression dialog box is filled out as shown below. Transfer the variables names age, educ and years into the Independent text box, by highlighting them and clicking on the appropriate arrow (>) button. The Dependent variable text box must contain the variable sales. For Method, select Enter and click OK to run the procedure. The Enter Method enter the variables in the block in a single step. Other entry methods include Backward, Forward, and Stepwise.
The Linear regression
dialog box fill out for multiple regression

The first table from the multiple regression procedure is shown below. It shows what method was selected to enter the variables. It also shows all the variables entered.

The next table shown below gives a summary of the regression model. The multiple regression coefficient (R) is 0.447. Recalling that for the simple regression case R was 0.393, we see that the answer to the question whether adding more independent variables improves the predictive power of the regression equation is 'yes'.

The next table from the multiple regression output listing is the ANOVA shown below.

The final table is the coefficients of the variables. From column B on the table, we can write the regression equation of total sales upon age, educ, years as:
Total sales = 124906 + 312x(age) + 3343x(years) - 935x(educ)
Note the coefficients have been multiplied by a thousand.
This equation tells us nothing about the relative important of each variable. The values for the coefficients reflect the original units in which the variables were measured. Therefore, we can not conclude that years of experience with a larger coefficient than age is a more important variable. The column on the table headed Beta gives us more information about the relative importance of the variables. Beta contains standardized coefficients. A change of one standard deviation in year will produce a change of 0.365 standard deviation in total sales. A change of one standard deviation in age will produce a change of only 0.05 standard deviation in total sales. It should also be noted that the independent variable with the largest beta weight also has the largest correlation with the dependent variable.

You should now be able to conduct and interpret a regression analysis, be it simple or multiple regression.