# Statıstıcs 2 Dersi 5. Ünite Sorularla Öğrenelim

**Açıköğretim ders notları** öğrenciler tarafından ders çalışma esnasında hazırlanmakta olup diğer ders çalışacak öğrenciler için paylaşılmaktadır. Sizlerde hazırladığınız ders notlarını paylaşmak istiyorsanız bizlere iletebilirsiniz.

Açıköğretim derslerinden Statıstıcs 2 Dersi 5. Ünite Sorularla Öğrenelim için hazırlanan ders çalışma dokümanına (ders özeti / sorularla öğrenelim) aşağıdan erişebilirsiniz. AÖF Ders Notları ile sınavlara çok daha etkili bir şekilde çalışabilirsiniz. Sınavlarınızda başarılar dileriz.

## Correlation And Regression Analysis

**1. Soru**

Why is correlation analysis used for? Explain in terms of real life problems.

**Cevap**

In real-life problems, sometimes the researcher might suspect that there is a relationship between some variables. Is it possible to prove that if the value of the one variable changes then the value of another variable may also change? Take a baby as an example: normally as the baby gets older, his or her weight increases. But not all the babies weight increases as the same amount, from baby to baby the amount of change on the weight is different for each baby. It is very useful if we can measure whether there is a relationship between two variables and the size of the relationship. The method of finding the degree of relationship between the variables is called correlation analysis.

**2. Soru**

Explain the use of scatterplot of data.

**Cevap**

In order to look at the relationship between two variables, a scatterplot of the data points can be drawn. In scatterplot, all pairwise values of the two continuous variables are plotted in two dimensions.

**3. Soru**

Discuss correlation analysis in terms of correlation and causation.

**Cevap**

We should caution the reader not to confuse correlation with causation. Correlation and causation are two different things. If we find a correlation betwen two variables it does not mean that one variable causes the other variable. The study of causation is totaly different from correlation.

**4. Soru**

Explain Pearson’s Correlation Coefficient in terms of symbols and values.

**Cevap**

In order to show the degree of the relationship between two continuous variables, The Pearson’s correlation coefficient or Pearson’s product moment correlation coefficient is widely used. There are two symbols used for Pearson’s correlation coefficient, population correlation is represented by ? (rho) and sample correlation is represented by r. Pearson’s correlation coefficient gets values between -1 and +1. If the result is exactly -1 or +1 then there is a perfect relationship between the variables. The sign of the Pearson’s correlation coefficient indicates the direction of the relationship. The positive sign indicates that the change in the variables is in the same direction.

**5. Soru**

Discuss Pearson’s correlation coefficient in terms of correlation and causation.

**Cevap**

The Pearson’s correlation coefficient is a very useful tool to show the degree of the linear relationship between two variables. The Pearson’s correlation coefficient is not a perfect tool, and we should be careful while using it. First of all, Pearson’s correlation coefficient does not show the causal relationship. A relationship shown by Pearson’s correlation coefficient does not necessarily mean that one variable’s existence is the cause of the second variable’s existence. There may always many known or unknown effects on the existence of a variable.

**6. Soru**

How can outlier problem be controlled for Pearson’ correlation coefficient?

**Cevap**

Pearson’s correlation coefficient is easily affected by the outliers. In order to control this problem, careful examination inspection of the relationship with a scatterplot will be very helpful. If a data pair is identified as an outlier or as unusual observations, the researcher may investigate the reasons of the existence of these values in such a problem and may or may not include these data points in his/her investigation.

**7. Soru**

Explain the coefficient of determination.

**Cevap**

In correlation and regression problems, the coefficient of determination measures the proportion of the variance of dependent variable (y), given by independent variable (x), when y is expressed as a linear regression on x. In brief, the coefficient of determination shows the amount of the variability of dependent variable (y) is explained by the independent variable (x). There may be more than one independent variable to explain the total variability of dependent variable y. It is not feasible to think that one variable will explain all the variability in dependent variable.

The coefficient of determination is calculated by using the square of Pearson’s correlation coefficient, r2. Remember Pearson’s correlation coefficient takes values between -1 and +1, therefore it is easy to see that the coefficient of determination takes values between 0 and +1. For example, let’s say that Pearson’s correlation coefficient is -0.48 in a problem. Therefore, we know that there is a negative moderate correlation between the variables. The coefficient of determination in the same problem is r2= (-0.48)2 = 0.23. The practical

meaning of this result can be written as follows: “the independent variable in this problem explains the 23% of the variability of the dependent variable”. If we consider the total variability of dependent variable as 100%, it means that there is still 77% of unaccounted variability for the dependent variable. There must be some other effects on the variability of the dependent variable.

**8. Soru**

Explain the significance test for Pearson’s correlation coefficient.

**Cevap**

In order to decide if there is a correlation between the population parameters of two continuous variables, we use sample information. We can test the null hypothesis of “population correlation is equal to 0”, or “there is no correlation between two variables” by using sample correlation coefficient. In this case, we will use the t distribution with n–2 degrees of freedom and with a specific significance level (?). The distribution table was given in chapter 3 in Table 3.3 and can be used in here also. In order to use this hypothesis testing, we assume that the underlying distributions of the two variables are Normal distributions. The hypothesis in the significance test for Pearson’s correlation coefficient are written as follows:

H0 😕 = 0 (There is no correlation. Populatio ncorrelation is zero)

H1 😕 ? 0 (There is a correlation. Populatio ncorrelation is not zero)

Then the observed statistic for the sample information is calculated by:

t= r . squareroot (n-2 /1-r2)

**9. Soru**

Why is regression analysis used?

**Cevap**

Since you now know about using Pearson’s correlation coefficient, it is possible to find the degree of a linear relationship between two continuous variables. You can even test the significance of the correlation coefficient for the population. What happens if you want to show a mathematical relationship between two variables? Can you create a model between these two variables? Specifically, can you define a model that represents the linear relationship between an independent variable (x; predictor, explanatory) and dependent variable (y; response, outcome). It is possible to estimate the values of dependent variable by using the values of the independent variable if we can define a model between these two variables. The process of creating a model of the linear relationship between independent variable and dependent variable is called as the regression analysis.

**10. Soru**

Explain simple linear regression.

**Cevap**

here, there will be only one independent variable and one dependent variable, it is called simple linear regression analysis.

Let’s say that, in a data set there are observed values of two variables, with n observations, x = x1, x2, x3,…, xn and y = y1, y2, y3, …, yn. In order to show a possible linear relationship between independent variable (x) and dependent variable (y), the following simple linear regression model can be written;

yi =? +ßxi +?i

In this simple linear regression model:

yi: i th observation’s value of the dependent variable,

xi: i th observation’s value of the independent variable,

?i : random error (the mean of it is zero),

? and ß : the population parameters to be estimated by sample data.

In simple linear regression, a model is created by using the sample data. As it may be seen from the equation for regression model given above, the value of the dependent variable is divided into some components. In the model, the random error ? represents the amount of variability in the dependent variable that cannot be explained by the linear relationship between independent and dependent variables. The regression model creates a line passing through the middle of data pairs and the method of least squares minimizes the overall distance of each data pair from the regression line. ? parameter is the intercept of the

regression line and ß is the slope of the regression line. ? parameter is usually called as the constant of the model.

**11. Soru**

To estimate the parameter on regression model, explain the least squares method.

**Cevap**

In order to estimate the parameter on regression model, we may use a method called the least squares. The least squares method essentially minimizes the sum of the squares of the residuals, the difference between an observed value and the value obtained from the proposed model for that object.

**12. Soru**

Explain the role of scatterplot and Pearson’ correlation coefficient in regression problems.

**Cevap**

In regression analysis problems, a scatterplot is always helpful. Remember that by looking at the scatterplot it is possible to see if there is a relationship between the variables. Then if there is any, by using Pearson’s correlation coefficient it is possible to show the direction and the degree of the relationship. The regression line gives you the similar results.

**13. Soru**

What are the assumptions to utilize simple linear regression analysis?

**Cevap**

There are some assumptions that need to be considered as follows:

The random variable ? is statistically independent from the values of independent variable.

The random variable ? follows a Normal distribution.

The arithmetic mean of the random variable ? is zero.

Two random errors such as ?i and ?j, i?j are independent of each other.

The relationship between the variables is linear.

The variance of the residuals at all levels of independent variable is constant

**14. Soru**

Explain the significance test of regression line.

**Cevap**

Once, a simple linear regression model is created, you may ask the following question to yourself “do we really need to know the values of the independent variable in order to estimate the values of the dependent variable?”. Specially, let’s put our attention on the slope of the regression line. Remember in the simple linear regression model for the population, the slope is represented by ß. If the value of the ß is equal to zero then this component should be taken out from the simple linear regression model, then the population regression line is just an average, yi= y?. Therefore, this means that we don’t need the independent variable (x) to predict the values of dependent value (y). If the value of the ß is not equal to zero, then the values of the independent variable can be used to predict the values of the dependent variable. Therefore, we need to test whether ß = 0. The alternative hypothesis, in this case, is two-sided and written as ß ? 0.

**15. Soru**

Explain multiple linear regression.

**Cevap**

In multiple linear regression, there are observed values of k (k?2) independent variables, with n observations, like x1= x1, x12, x13, …, x1n ; x2= x21, x22, x23, …, x1n ; … ; xk

= xk1, xk2 xk3 , …, xkn. There is alsothe values of one dependent variable, y= y1, y2, y3, …, yn. In order to show a possible linear relationship between k(k?2) independent variables and dependent variable (y), the following multiple linear regression model can be written as follows:

yi = ß0 +ß1x1i +ß2x2i +ß3x3i +…+ßkxki +?i (i =1,….,n;n ? k +1)

In this multiple linear regression model:

yi : i th observation’s value of the dependent variable,

x1i : i th observation’s value of the first independent variable,

xki : i th observation’s value of the kth independent variable,

?i : random error (the mean of it is zero),

k : the number of independent variables,

ß0, ß1, ß2, … , ßk : the population parameters to be estimated by sample data.

If you remember the simple linear regression model, the general structure is still the same. Now in multiple linear regression, the constant of simple linear regression ? is represented by ß0. The linearity of the model comes from the parameters of the model. It is possible to include the powers of any independent variable (such as squares) into the model as a new independent variable. Here, again the model of least squares is used to estimate the values of the population parameters of multiple linear regression. But in order to use the model of least squares, we need to transform our data in to matrix form.

**16. Soru**

Express the least squares estimator *b* of the multiple linear regression coefficient ß’s.

**Cevap**

b = (X’X)^-1 X’y

**1. Soru**

Why is correlation analysis used for? Explain in terms of real life problems.

**Cevap**

In real-life problems, sometimes the researcher might suspect that there is a relationship between some variables. Is it possible to prove that if the value of the one variable changes then the value of another variable may also change? Take a baby as an example: normally as the baby gets older, his or her weight increases. But not all the babies weight increases as the same amount, from baby to baby the amount of change on the weight is different for each baby. It is very useful if we can measure whether there is a relationship between two variables and the size of the relationship. The method of finding the degree of relationship between the variables is called correlation analysis.

**2. Soru**

Explain the use of scatterplot of data.

**Cevap**

In order to look at the relationship between two variables, a scatterplot of the data points can be drawn. In scatterplot, all pairwise values of the two continuous variables are plotted in two dimensions.

**3. Soru**

Discuss correlation analysis in terms of correlation and causation.

**Cevap**

We should caution the reader not to confuse correlation with causation. Correlation and causation are two different things. If we find a correlation betwen two variables it does not mean that one variable causes the other variable. The study of causation is totaly different from correlation.

**4. Soru**

Explain Pearson’s Correlation Coefficient in terms of symbols and values.

**Cevap**

In order to show the degree of the relationship between two continuous variables, The Pearson’s correlation coefficient or Pearson’s product moment correlation coefficient is widely used. There are two symbols used for Pearson’s correlation coefficient, population correlation is represented by ? (rho) and sample correlation is represented by r. Pearson’s correlation coefficient gets values between -1 and +1. If the result is exactly -1 or +1 then there is a perfect relationship between the variables. The sign of the Pearson’s correlation coefficient indicates the direction of the relationship. The positive sign indicates that the change in the variables is in the same direction.

**5. Soru**

Discuss Pearson’s correlation coefficient in terms of correlation and causation.

**Cevap**

The Pearson’s correlation coefficient is a very useful tool to show the degree of the linear relationship between two variables. The Pearson’s correlation coefficient is not a perfect tool, and we should be careful while using it. First of all, Pearson’s correlation coefficient does not show the causal relationship. A relationship shown by Pearson’s correlation coefficient does not necessarily mean that one variable’s existence is the cause of the second variable’s existence. There may always many known or unknown effects on the existence of a variable.

**6. Soru**

How can outlier problem be controlled for Pearson’ correlation coefficient?

**Cevap**

Pearson’s correlation coefficient is easily affected by the outliers. In order to control this problem, careful examination inspection of the relationship with a scatterplot will be very helpful. If a data pair is identified as an outlier or as unusual observations, the researcher may investigate the reasons of the existence of these values in such a problem and may or may not include these data points in his/her investigation.

**7. Soru**

Explain the coefficient of determination.

**Cevap**

In correlation and regression problems, the coefficient of determination measures the proportion of the variance of dependent variable (y), given by independent variable (x), when y is expressed as a linear regression on x. In brief, the coefficient of determination shows the amount of the variability of dependent variable (y) is explained by the independent variable (x). There may be more than one independent variable to explain the total variability of dependent variable y. It is not feasible to think that one variable will explain all the variability in dependent variable.

The coefficient of determination is calculated by using the square of Pearson’s correlation coefficient, r2. Remember Pearson’s correlation coefficient takes values between -1 and +1, therefore it is easy to see that the coefficient of determination takes values between 0 and +1. For example, let’s say that Pearson’s correlation coefficient is -0.48 in a problem. Therefore, we know that there is a negative moderate correlation between the variables. The coefficient of determination in the same problem is r2= (-0.48)2 = 0.23. The practical

meaning of this result can be written as follows: “the independent variable in this problem explains the 23% of the variability of the dependent variable”. If we consider the total variability of dependent variable as 100%, it means that there is still 77% of unaccounted variability for the dependent variable. There must be some other effects on the variability of the dependent variable.

**8. Soru**

Explain the significance test for Pearson’s correlation coefficient.

**Cevap**

In order to decide if there is a correlation between the population parameters of two continuous variables, we use sample information. We can test the null hypothesis of “population correlation is equal to 0”, or “there is no correlation between two variables” by using sample correlation coefficient. In this case, we will use the t distribution with n–2 degrees of freedom and with a specific significance level (?). The distribution table was given in chapter 3 in Table 3.3 and can be used in here also. In order to use this hypothesis testing, we assume that the underlying distributions of the two variables are Normal distributions. The hypothesis in the significance test for Pearson’s correlation coefficient are written as follows:

H0 😕 = 0 (There is no correlation. Populatio ncorrelation is zero)

H1 😕 ? 0 (There is a correlation. Populatio ncorrelation is not zero)

Then the observed statistic for the sample information is calculated by:

t= r . squareroot (n-2 /1-r2)

**9. Soru**

Why is regression analysis used?

**Cevap**

Since you now know about using Pearson’s correlation coefficient, it is possible to find the degree of a linear relationship between two continuous variables. You can even test the significance of the correlation coefficient for the population. What happens if you want to show a mathematical relationship between two variables? Can you create a model between these two variables? Specifically, can you define a model that represents the linear relationship between an independent variable (x; predictor, explanatory) and dependent variable (y; response, outcome). It is possible to estimate the values of dependent variable by using the values of the independent variable if we can define a model between these two variables. The process of creating a model of the linear relationship between independent variable and dependent variable is called as the regression analysis.

**10. Soru**

Explain simple linear regression.

**Cevap**

here, there will be only one independent variable and one dependent variable, it is called simple linear regression analysis.

Let’s say that, in a data set there are observed values of two variables, with n observations, x = x1, x2, x3,…, xn and y = y1, y2, y3, …, yn. In order to show a possible linear relationship between independent variable (x) and dependent variable (y), the following simple linear regression model can be written;

yi =? +ßxi +?i

In this simple linear regression model:

yi: i th observation’s value of the dependent variable,

xi: i th observation’s value of the independent variable,

?i : random error (the mean of it is zero),

? and ß : the population parameters to be estimated by sample data.

In simple linear regression, a model is created by using the sample data. As it may be seen from the equation for regression model given above, the value of the dependent variable is divided into some components. In the model, the random error ? represents the amount of variability in the dependent variable that cannot be explained by the linear relationship between independent and dependent variables. The regression model creates a line passing through the middle of data pairs and the method of least squares minimizes the overall distance of each data pair from the regression line. ? parameter is the intercept of the

regression line and ß is the slope of the regression line. ? parameter is usually called as the constant of the model.

**11. Soru**

To estimate the parameter on regression model, explain the least squares method.

**Cevap**

In order to estimate the parameter on regression model, we may use a method called the least squares. The least squares method essentially minimizes the sum of the squares of the residuals, the difference between an observed value and the value obtained from the proposed model for that object.

**12. Soru**

Explain the role of scatterplot and Pearson’ correlation coefficient in regression problems.

**Cevap**

In regression analysis problems, a scatterplot is always helpful. Remember that by looking at the scatterplot it is possible to see if there is a relationship between the variables. Then if there is any, by using Pearson’s correlation coefficient it is possible to show the direction and the degree of the relationship. The regression line gives you the similar results.

**13. Soru**

What are the assumptions to utilize simple linear regression analysis?

**Cevap**

There are some assumptions that need to be considered as follows:

The random variable ? is statistically independent from the values of independent variable.

The random variable ? follows a Normal distribution.

The arithmetic mean of the random variable ? is zero.

Two random errors such as ?i and ?j, i?j are independent of each other.

The relationship between the variables is linear.

The variance of the residuals at all levels of independent variable is constant

**14. Soru**

Explain the significance test of regression line.

**Cevap**

Once, a simple linear regression model is created, you may ask the following question to yourself “do we really need to know the values of the independent variable in order to estimate the values of the dependent variable?”. Specially, let’s put our attention on the slope of the regression line. Remember in the simple linear regression model for the population, the slope is represented by ß. If the value of the ß is equal to zero then this component should be taken out from the simple linear regression model, then the population regression line is just an average, yi= y?. Therefore, this means that we don’t need the independent variable (x) to predict the values of dependent value (y). If the value of the ß is not equal to zero, then the values of the independent variable can be used to predict the values of the dependent variable. Therefore, we need to test whether ß = 0. The alternative hypothesis, in this case, is two-sided and written as ß ? 0.

**15. Soru**

Explain multiple linear regression.

**Cevap**

In multiple linear regression, there are observed values of k (k?2) independent variables, with n observations, like x1= x1, x12, x13, …, x1n ; x2= x21, x22, x23, …, x1n ; … ; xk

= xk1, xk2 xk3 , …, xkn. There is alsothe values of one dependent variable, y= y1, y2, y3, …, yn. In order to show a possible linear relationship between k(k?2) independent variables and dependent variable (y), the following multiple linear regression model can be written as follows:

yi = ß0 +ß1x1i +ß2x2i +ß3x3i +…+ßkxki +?i (i =1,….,n;n ? k +1)

In this multiple linear regression model:

yi : i th observation’s value of the dependent variable,

x1i : i th observation’s value of the first independent variable,

xki : i th observation’s value of the kth independent variable,

?i : random error (the mean of it is zero),

k : the number of independent variables,

ß0, ß1, ß2, … , ßk : the population parameters to be estimated by sample data.

If you remember the simple linear regression model, the general structure is still the same. Now in multiple linear regression, the constant of simple linear regression ? is represented by ß0. The linearity of the model comes from the parameters of the model. It is possible to include the powers of any independent variable (such as squares) into the model as a new independent variable. Here, again the model of least squares is used to estimate the values of the population parameters of multiple linear regression. But in order to use the model of least squares, we need to transform our data in to matrix form.

**16. Soru**

Express the least squares estimator *b* of the multiple linear regression coefficient ß’s.

**Cevap**

b = (X’X)^-1 X’y