Assessment of the statistical significance of the regression equation and its parameters. Assessment of the significance of the regression equation and its coefficients

27.09.2019 | Finance

After assessing the individual statistical significance of each of the regression coefficients, the cumulative significance of the coefficients is usually analyzed, i.e. the entire equation as a whole. Such an analysis is carried out on the basis of testing the hypothesis about the overall significance of the hypothesis about the simultaneous equality to zero of all regression coefficients with explanatory variables:

H 0: b 1 = b 2 = ... = b m = 0.

If this hypothesis is not rejected, then it is concluded that the cumulative effect of all m explanatory variables X 1 , X 2 , ..., X m of the model on the dependent variable Y can be considered statistically insignificant, and the overall quality of the regression equation is low.

This hypothesis is tested on the basis of analysis of variance comparison of explained and residual variance.

H 0: (explained variance) = (residual variance),

H 1: (explained variance) > (residual variance).

The F-statistic is built:

Where is the variance explained by the regression;

– residual dispersion (sum of squared deviations divided by the number of degrees of freedom n-m-1). When the LSM prerequisites are met, the constructed F-statistic has a Fisher distribution with the numbers of degrees of freedom n1 = m, n2 = n–m–1. Therefore, if at the required level of significance a F obs > F a ; m n - m -1 \u003d F a (where F a; m; n - m -1 is the critical point of the Fisher distribution), then H 0 deviates in favor of H 1. This means that the variance explained by the regression is significantly greater than the residual variance, and, consequently, the regression equation reflects quite qualitatively the dynamics of the change in the dependent variable Y. If F observable< F a ; m ; n - m -1 = F кр. , то нет основания для отклонения Н 0 . Значит, объясненная дисперсия соизмерима с дисперсией, вызванной случайными факторами. Это дает основание считать, что совокупное влияние объясняющих переменных модели несущественно, а следовательно, общее качество модели невысоко.

However, in practice, instead of this hypothesis, a closely related hypothesis about the statistical significance of the coefficient of determination R 2 is checked:

H 0: R 2 > 0.

To test this hypothesis, the following F-statistic is used:

. (8.20)

The value of F, provided that the LSM prerequisites are met and that H 0 is valid, has a Fisher distribution similar to the distribution of the F-statistics (8.19). Indeed, dividing the numerator and denominator of the fraction in (8.19) by the total sum of squared deviations and knowing that it breaks down into the sum of squared deviations, explained by the regression, and the residual sum of squared deviations (this is a consequence, as will be shown later, of the system of normal equations)

we get the formula (8.20):

From (8.20) it is obvious that the exponents F and R 2 are equal or not equal to zero at the same time. If F = 0, then R 2 = 0, and the regression line Y = is the best OLS, and, therefore, the value of Y does not linearly depend on X 1 , X 2 , ..., X m . To test the null hypothesis H 0: F = 0 at a given significance level a according to the tables of critical points of Fisher's distribution is the critical value of F kr = F a ; m n - m -1 . The null hypothesis is rejected if F > F cr. This is equivalent to the fact that R 2 > 0, i.e. R 2 is statistically significant.

An analysis of the statistics F allows us to conclude that in order to accept the hypothesis of the simultaneous equality to zero of all coefficients linear regression the coefficient of determination R 2 should not differ significantly from zero. Its critical value decreases with an increase in the number of observations and can become arbitrarily small.

Let, for example, when assessing a regression with two explanatory variables X 1 i , X 2 i for 30 observations R 2 = 0.65. Then

Fobs = =25.07.

According to the tables of critical points of the Fisher distribution, we find F 0.05; 2; 27 = 3.36; F 0.01; 2; 27 = 5.49. Since F obl = 25.07 > F cr both at 5% and at 1% significance level, the null hypothesis is rejected in both cases.

If in the same situation R 2 = 0.4, then

Fobs = = 9.

The assumption of the insignificance of the connection is rejected here as well.

Note that in the case of pairwise regression, testing the null hypothesis for the F-statistic is equivalent to testing the null hypothesis for the t-statistic

correlation coefficient. In this case, the F-statistic is equal to the square of the t-statistic. The coefficient R 2 acquires independent significance in the case of multiple linear regression.

8.6. Analysis of variance to decompose the total sum of squared deviations. Degrees of freedom for the corresponding sums of squared deviations

Let's apply the above theory for pairwise linear regression.

After the linear regression equation is found, the significance of both the equation as a whole and its individual parameters is assessed.

The assessment of the significance of the regression equation as a whole is given using the Fisher F-test. In this case, a null hypothesis is put forward that the regression coefficient is equal to zero, i.e. b = 0, and hence the factor x has no effect on the result y.

The direct calculation of the F-criterion is preceded by an analysis of the variance. The central place in it is occupied by the decomposition of the total sum of squared deviations of the variable y from the mean value into two parts - “explained” and “unexplained”:

Equation (8.21) is a consequence of the system of normal equations derived in one of the previous topics.

Proof of expression (8.21).

It remains to prove that the last term is equal to zero.

If you add up all the equations from 1 to n

y i = a+b×x i + e i , (8.22)

then we get åy i = a×å1+b×åx i +åe i . Since åe i =0 and å1 =n, we get

Then .

If we subtract equation (8.23) from expression (8.22), then we get

As a result, we get

The last sums are equal to zero due to the system of two normal equations.

The total sum of the squared deviations of the individual values of the effective attribute y from the average value is caused by the influence of many reasons. We conditionally divide the entire set of causes into two groups: the studied factor x and other factors. If the factor on has no effect on the result, then the regression line is parallel to the OX axis and . Then the entire dispersion of the resulting attribute is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not affect the result, then y is functionally related to x and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.

Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor x, i.e. regression of y on x, and caused by the action of other causes (unexplained variation). The suitability of the regression line for prediction depends on how much of the total variation of the trait y is accounted for by the explained variation. Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the x factor has a significant impact on the y sign. This is equivalent to the fact that the coefficient of determination will approach unity.

Any sum of squares is associated with the number of degrees of freedom (df - degrees of freedom), with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations out of n possible are required to form a given sum of squares. So, for the total sum of squares, (n-1) independent deviations are required, because in the aggregate of n units, after calculating the average, only (n-1) the number of deviations freely vary. For example, we have a series of y values: 1,2,3,4,5. The average of them is 3, and then n deviations from the average will be: -2, -1, 0, 1, 2. Since , then only four deviations freely vary, and the fifth deviation can be determined if the previous four are known.

When calculating the explained or factorial sum of squares theoretical (calculated) values of the effective feature are used

Then the sum of squared deviations due to linear regression is equal to

Since, for a given amount of observations in x and y, the factorial sum of squares in linear regression depends only on the regression constant b, this sum of squares has only one degree of freedom.

There is an equality between the number of degrees of freedom of the total, factorial and residual sum of squared deviations. The number of degrees of freedom of the residual sum of squares in linear regression is n-2. The number of degrees of freedom of the total sum of squares is determined by the number of units of variable features, and since we use the average calculated from the sample data, we lose one degree of freedom, i.e. df total = n–1.

So we have two equalities:

Dividing each sum of squares by the number of degrees of freedom corresponding to it, we obtain the mean square of the deviations, or, equivalently, the variance per one degree of freedom D.

;

Determining the dispersion per one degree of freedom brings the dispersions to a comparable form. Comparing the factorial and residual variances per one degree of freedom, we obtain the value of Fisher's F-criterion

where F-criterion for testing the null hypothesis H 0: D fact = D rest.

If the null hypothesis is true, then the factorial and residual variances do not differ from each other. For H 0, a refutation is necessary so that the factor variance exceeds the residual by several times. The English statistician Snedekor developed tables of critical values of F-ratios for various levels of significance of the null hypothesis and a different number of degrees of freedom. The tabular value of the F-criterion is the maximum value of the ratio of variances that can occur if they diverge randomly for given level the probability of having a null hypothesis. The calculated value of the F-ratio is recognized as reliable if it is greater than the tabular one. If F fact > F table, then the null hypothesis H 0: D fact = D rest about the absence of a relationship of features is rejected and a conclusion is made about the significance of this relationship.

If F is a fact< F табл, то вероятность нулевой гипотезы H 0: D факт = D ост выше заданного уровня (например, 0,05) и она не может быть отклонена без серьёзного риска сделать wrong conclusion about the connection. In this case, the regression equation is considered statistically insignificant. The hypothesis H 0 is not rejected.

In this example from Chapter 3:

\u003d 131200 -7 * 144002 \u003d 30400 - the total sum of the squares;

1057.878*(135.43-7*(3.92571) 2) = 28979.8 - factor sum of squares;

\u003d 30400-28979.8 \u003d 1420.197 - residual sum of squares;

D fact = 28979.8;

D rest \u003d 1420.197 / (n-2) \u003d 284.0394;

F fact \u003d 28979.8 / 284.0394 \u003d 102.0274;

Fa=0.05; 2; 5=6.61; Fa=0.01; 2; 5 = 16.26.

Since F fact > F table both at 1% and at 5% significance level, we can conclude that the regression equation is significant (the relationship is proven).

The value of the F-criterion is related to the coefficient of determination. The factor sum of squared deviations can be represented as

and the residual sum of squares as

Then the value of the F-criterion can be expressed as

An assessment of the significance of a regression is usually given in the form of an analysis of variance table

, its value is compared with the table value at a certain significance level α and the number of degrees of freedom (n-2).

Sources of Variation	Number of degrees of freedom	Sum of squared deviations	Dispersion per degree of freedom	F-ratio
actual	Tabular at a=0.05
General
Explained		28979,8	28979,8	102,0274	6,61
Residual		1420,197	284,0394

Final tests in econometrics

1. The assessment of the significance of the parameters of the regression equation is carried out on the basis of:

A) t - Student's criterion;

b) F-criterion of Fisher - Snedekor;

c) mean square error;

G) average error approximations.

2. The regression coefficient in the equation characterizing the relationship between the volume of sales (million rubles) and the profit of enterprises in the automotive industry for the year (million rubles) means that with an increase in the volume of sales by 1 million rubles profit increases by:

d) 0.5 million rub.;

c) 500 thousand. rub.;

D) 1.5 million rubles

3. Correlation ratio (correlation index) measures the degree of closeness of the relationship between X andY:

a) only with a non-linear form of dependence;

B) with any form of addiction;

c) only when linear dependence.

4. In the direction of communication there are:

a) moderate;

B) straight;

c) rectilinear.

5. Based on 17 observations, a regression equation was built:
. To check the significance of the equation, we calculatedobserved valuet- statistics: 3.9. Conclusion:

A) The equation is significant for a = 0,05;

b) The equation is insignificant at a = 0.01;

c) The equation is not significant at a = 0.05.

6. What are the consequences of violating the OLS assumption “the expectation of regression residuals is zero”?

A) Biased estimates of regression coefficients;

b) Efficient but inconsistent estimates of regression coefficients;

c) Inefficient estimates of regression coefficients;

d) Inconsistent estimates of regression coefficients.

7. Which of the following statements is true in case of heteroskedasticity of residuals?

A) Conclusions on t and F-statistics are unreliable;

d) Estimates of the parameters of the regression equation are biased.

8. What is the test based on? rank correlation Spearman?

A) On the use of t - statistics;

c) On use ;

9. What is the White test based on?

b) On the use of F-statistics;

B) in use ;

d) On the graphical analysis of the residuals.

10. What method can be used to eliminate autocorrelation?

11. What is the violation of the assumption of the constancy of the variance of residuals called?

a) Multicollinearity;

b) Autocorrelation;

B) Heteroskedasticity;

d) Homoscedasticity.

12. Dummy variables are introduced into:

a) only in linear models;

b) only in multiple non-linear regression;

c) only in nonlinear models;

D) both linear and non-linear models reduced to a linear form.

13. If in the matrix of paired correlation coefficients there are
, then this shows:

A) About the presence of multicollinearity;

b) About the absence of multicollinearity;

c) About the presence of autocorrelation;

d) About the absence of heteroscedasticity.

14. What measure is impossible to get rid of multicollinearity?

a) Increasing the sample size;

D) Transformation of the random component.

15. If
and the rank of matrix A is less than (K-1) then the equation:

a) over-identified;

B) not identified;

c) accurately identified.

16. The regression equation looks like:

A)
;

b)
;

V)
.

17. What is the problem of model identification?

A) obtaining uniquely defined parameters of the model given by the system of simultaneous equations;

b) selection and implementation of methods for statistical estimation of unknown parameters of the model according to the initial statistical data;

c) checking the adequacy of the model.

18. What method is used to estimate the parameters of an over-identified equation?

C) DMNK, KMNK;

19. If a qualitative variable haskalternative values, then the simulation uses:

A) (k-1) dummy variable;

b) kdummy variables;

c) (k+1) dummy variable.

20. Analysis of the closeness and direction of the links of two signs is carried out on the basis of:

A) pair correlation coefficient;

b) coefficient of determination;

c) multiple correlation coefficient.

21. In a linear equation x = A 0 +a 1 x regression coefficient shows:

a) the closeness of the connection;

b) proportion of variance "Y" dependent on "X";

C) how much "Y" will change on average when "X" changes by one unit;

d) correlation coefficient error.

22. What indicator is used to determine the part of the variation due to a change in the value of the factor under study?

a) coefficient of variation;

b) correlation coefficient;

C) coefficient of determination;

d) coefficient of elasticity.

23. The coefficient of elasticity shows:

A) by what% will the value of y change when x changes by 1%;

b) by how many units of its measurement the value of y will change when x changes by 1%;

c) by how much % will the value of y change when x changes by unit. your measurement.

24. What methods can be applied to detect heteroscedasticity?

A) Golfeld-Quandt test;

B) Spearman's rank correlation test;

c) Durbin-Watson test.

25. What is the basis of the Golfeld-Quandt test

a) On the use of t-statistics;

B) On the use of F - statistics;

c) On use ;

d) On the graphical analysis of the residuals.

26. What methods cannot be used to eliminate the autocorrelation of residuals?

a) Generalized method least squares;

B) Weighted least squares method;

C) the maximum likelihood method;

D) Two-step method of least squares.

27. What is the violation of the assumption of independence of residuals called?

a) Multicollinearity;

B) Autocorrelation;

c) Heteroskedasticity;

d) Homoscedasticity.

28. What method can be used to eliminate heteroscedasticity?

A) Generalized method of least squares;

b) Weighted least squares method;

c) The maximum likelihood method;

d) Two-step least squares method.

30. If byt-criterion, most of the regression coefficients are statistically significant, and the model as a wholeF- the criterion is insignificant, then this may indicate:

a) Multicollinearity;

B) On the autocorrelation of residuals;

c) On heteroscedasticity of residues;

d) This option is not possible.

31. Is it possible to get rid of multicollinearity by transforming variables?

a) This measure is effective only when the sample size is increased;

32. What method can be used to find estimates of the parameter of the linear regression equation:

A) the least squares method;

b) correlation and regression analysis;

c) analysis of variance.

33. A multiple linear regression equation with dummy variables is constructed. To check the significance of individual coefficients, we use distribution:

a) Normal;

b) Student;

c) Pearson;

d) Fischer-Snedekor.

34. If
and the rank of matrix A is greater than (K-1) then the equation:

A) over-identified;

b) not identified;

c) accurately identified.

35. To estimate the parameters of a precisely identifiable system of equations, the following is used:

a) DMNK, KMNK;

b) DMNK, MNK, KMNK;

36. Chow's criterion is based on the application of:

A) F - statistics;

b) t - statistics;

c) Durbin-Watson criteria.

37. Dummy variables can take on the following values:

d) any values.

39. Based on 20 observations, a regression equation was built:
. To check the significance of the equation, the value of the statistic is calculated:4.2. Conclusions:

a) The equation is significant at a=0.05;

b) The equation is not significant at a=0.05;

c) The equation is not significant at a=0.01.

40. Which of the following statements is not true if the residuals are heteroscedastic?

a) Conclusions on t and F statistics are unreliable;

b) Heteroskedasticity manifests itself through the low value of the Durbin-Watson statistics;

c) With heteroscedasticity, estimates remain effective;

d) Estimates are biased.

41. The Chow test is based on a comparison:

A) dispersions;

b) coefficients of determination;

c) mathematical expectations;

d) medium.

42. If in the Chow test
then it is considered:

A) that partitioning into subintervals is useful from the point of view of improving the quality of the model;

b) the model is statistically insignificant;

c) the model is statistically significant;

d) that it makes no sense to split the sample into parts.

43. Dummy variables are variables:

a) quality;

b) random;

B) quantitative;

d) logical.

44. Which of the following methods cannot be used to detect autocorrelation?

a) Series method;

b) Durbin-Watson test;

c) Spearman's rank correlation test;

D) White's test.

45. The simplest structural form of the model is:

G)
.

46. What measures can be taken to get rid of multicollinearity?

a) Increasing the sample size;

b) Exclusion of variables highly correlated with the rest;

c) Change of model specification;

d) Transformation of the random component.

47. If
and the rank of matrix A is (K-1) then the equation:

a) over-identified;

b) not identified;

B) accurately identified;

48. A model is considered identified if:

a) among the equations of the model there is at least one normal one;

B) each equation of the system is identifiable;

c) among the model equations there is at least one unidentified one;

d) among the equations of the model there is at least one overidentified.

49. What method is used to estimate the parameters of an unidentified equation?

a) DMNK, KMNK;

b) DMNC, MNC;

C) the parameters of such an equation cannot be estimated.

50. At the junction of what areas of knowledge did econometrics arise:

A) economic theory; economic and mathematical statistics;

b) economic theory, mathematical statistics and probability theory;

c) economic and mathematical statistics, probability theory.

51. In the multiple linear regression equation, confidence intervals are built for the regression coefficients using the distribution:

a) Normal;

B) Student;

c) Pearson;

d) Fischer-Snedekor.

52. Based on 16 observations, a paired linear regression equation was constructed. Forregression coefficient significance check computedt for 6l =2.5.

a) The coefficient is insignificant at a=0.05;

b) The coefficient is significant at a=0.05;

c) The coefficient is significant at a=0.01.

53. It is known that between quantitiesXAndYexistspositive connection. To what extentis the pairwise correlation coefficient?

a) from -1 to 0;

b) from 0 to 1;

C) from -1 to 1.

54. The multiple correlation coefficient is 0.9. What percentagedispersion of the resultant attribute is explained by the influence of allfactor traits?

55. Which of the following methods cannot be used to detect heteroscedasticity?

A) Golfeld-Quandt test;

b) Spearman's rank correlation test;

c) series method.

56. The given form of the model is:

a) a system of nonlinear functions of exogenous variables from endogenous ones;

B) a system of linear functions of endogenous variables from exogenous ones;

c) a system of linear functions of exogenous variables from endogenous ones;

d) a system of normal equations.

57. Within what limits does the partial correlation coefficient calculated by recursive formulas change?

a) from - to + ;

b) from 0 to 1;

c) from 0 to + ;

D) from -1 to +1.

58. Within what limits does the partial correlation coefficient calculated through the coefficient of determination change?

a) from - to + ;

B) from 0 to 1;

c) from 0 to + ;

d) from –1 to +1.

59. Exogenous variables:

a) dependent variables;

B) independent variables;

61. When adding another explanatory factor to the regression equation, the multiple correlation coefficient:

a) will decrease

b) will increase;

c) retain its value.

62. A hyperbolic regression equation was built:Y= a+ b/ X. ForThe significance test of the equation uses the distribution:

a) Normal;

B) Student;

c) Pearson;

d) Fischer-Snedekor.

63. For what types of systems can the parameters of individual econometric equations be found using the traditional least squares method?

a) a system of normal equations;

B) a system of independent equations;

C) a system of recursive equations;

D) a system of interdependent equations.

64. Endogenous variables:

A) dependent variables;

b) independent variables;

c) dated from previous points in time.

65. Within what limits does the coefficient of determination change?

a) from 0 to + ;

b) from - to + ;

C) from 0 to +1;

d) from -l to +1.

66. A multiple linear regression equation has been built. To check the significance of individual coefficients, we use distribution:

a) Normal;

b) Student;

c) Pearson;

D) Fischer-Snedekor.

67. When adding another explanatory factor to the regression equation, the coefficient of determination:

a) will decrease

B) will increase;

c) retain its value;

d) will not decrease.

68. The essence of the least squares method is that:

A) the estimate is determined from the condition of minimizing the sum of squared deviations of the sample data from the determined estimate;

b) the estimate is determined from the condition of minimizing the sum of deviations of sample data from the determined estimate;

c) the estimate is determined from the condition of minimizing the sum of squared deviations of the sample mean from the sample variance.

69. What class of non-linear regressions does the parabola belong to:

73. What class of non-linear regressions does the exponential curve belong to:

74. What class of non-linear regressions does a function of the form ŷ belong to
:

A) regressions that are non-linear with respect to the variables included in the analysis, but linear with respect to the estimated parameters;

b) non-linear regressions on the estimated parameters.

78. What class of non-linear regressions does a function of the form ŷ belong to
:

a) regressions that are non-linear with respect to the variables included in the analysis, but linear with respect to the estimated parameters;

B) non-linear regressions on the estimated parameters.

79. In the regression equation in the form of a hyperbola ŷ
if the valueb >0 , That:

A) with an increase in the factor trait X the value of the resultant attribute at decrease slowly, and x→∞ average value at will be equal to A;

b) the value of the effective feature at increases with slow growth with an increase in the factor trait X, and at x→∞

81. The coefficient of elasticity is determined by the formula

A) Linear function;

b) Parabolas;

c) Hyperbolas;

d) exponential curve;

e) Power.

82. The coefficient of elasticity is determined by the formula
for a regression model in the form:

a) Linear function;

B) Parabolas;

c) Hyperbolas;

d) exponential curve;

e) Power.

86. Equation
called:

A) a linear trend

b) parabolic trend;

c) hyperbolic trend;

d) exponential trend.

89. Equation
called:

a) a linear trend;

b) parabolic trend;

c) hyperbolic trend;

D) an exponential trend.

90. System views called:

A) a system of independent equations;

b) a system of recursive equations;

c) a system of interdependent (simultaneous, simultaneous) equations.

93. Econometrics can be defined as:

A) it is an independent scientific discipline that combines a set of theoretical results, techniques, methods and models designed to give a specific quantitative expression to general (qualitative) patterns due to economic theory on the basis of economic theory, economic statistics and mathematical and statistical tools;

B) the science of economic measurements;

C) statistical analysis of economic data.

94. The tasks of econometrics include:

A) forecast of economic and socio-economic indicators characterizing the state and development of the analyzed system;

B) simulation of possible scenarios for the socio-economic development of the system to identify how the planned changes in certain manageable parameters will affect the output characteristics;

c) testing of hypotheses according to statistical data.

95. Relationships are distinguished by their nature:

A) functional and correlation;

b) functional, curvilinear and rectilinear;

c) correlation and inverse;

d) statistical and direct.

96. With a direct connection with an increase in a factor trait:

a) the effective sign decreases;

b) the effective attribute does not change;

C) the performance indicator increases.

97. What methods are used to identify the presence, nature and direction of association in statistics?

a) average values;

B) comparison of parallel rows;

C) analytical grouping method;

d) relative values;

D) graphical method.

98. What method is used to identify the forms of influence of some factors on others?

a) correlation analysis;

B) regression analysis;

c) index analysis;

d) analysis of variance.

99. What method is used to quantify the strength of the impact of some factors on others:

A) correlation analysis;

b) regression analysis;

c) the method of averages;

d) analysis of variance.

100. What indicators in their magnitude exist in the range from minus to plus one:

a) coefficient of determination;

b) correlation ratio;

IN) linear coefficient correlations.

101. The regression coefficient for a one-factor model shows:

A) how many units the function changes when the argument changes by one unit;

b) how many percent the function changes per unit change in the argument.

102. The coefficient of elasticity shows:

a) by how many percent does the function change with a change in the argument by one unit of its measurement;

B) by how many percent does the function change with a change in the argument by 1%;

c) by how many units of its measurement the function changes with a change in the argument by 1%.

105. The value of the correlation index, equal to 0.087, indicates:

A) about their weak dependence;

b) a strong relationship;

c) errors in calculations.

107. The value of the pair correlation coefficient, equal to 1.12, indicates:

a) about their weak dependence;

b) a strong relationship;

C) about errors in calculations.

109. Which of the given numbers can be the values of the pair correlation coefficient:

111. Which of the given numbers can be the values of the multiple correlation coefficient:

115. Mark the correct shape linear equation regressions:

a) s
;

b) ŷ
;

c) ŷ
;

D) ŷ
.

Estimating the Significance of an Equation multiple regression

The construction of an empirical regression equation is initial stage econometric analysis. The first regression equation built on the basis of a sample is very rarely satisfactory in terms of one or another characteristic. Therefore, the next most important task of econometric analysis is to check the quality of the regression equation. In econometrics, a well-established scheme for such verification is adopted.

So, the verification of the statistical quality of the estimated regression equation is carried out in the following areas:

Checking the significance of the regression equation;

Checking the statistical significance of the coefficients of the regression equation;

Checking the properties of the data, the feasibility of which was assumed when evaluating the equation (checking the feasibility of the LSM prerequisites).

Checking the significance of the multiple regression equation, as well as paired regression, is carried out using the Fisher criterion. In this case (unlike pairwise regression), the null hypothesis is put forward H 0 that all regression coefficients are equal to zero ( b 1=0, b 2=0, … , b m=0). The Fisher criterion is determined by the following formula:

Where D fact - factorial variance, explained by regression, per one degree of freedom; D os - residual dispersion per one degree of freedom; R2- coefficient of multiple determination; T X in the regression equation (in paired linear regression T= 1); P - number of observations.

The obtained value of the F-criterion is compared with the table value at a certain level of significance. If its actual value is greater than the table value, then the hypothesis But about the insignificance of the regression equation is rejected, and an alternative hypothesis about its statistical significance is accepted.

Using the Fisher criterion, one can evaluate the significance of not only the regression equation as a whole, but also the significance of the additional inclusion of each factor in the model. Such an assessment is necessary in order not to load the model with factors that do not significantly affect the result. In addition, since the model consists of several factors, they can be introduced into it in a different sequence, and since there is a correlation between the factors, the significance of including the same factor in the model may differ depending on the sequence of factors introduced into it.

To assess the significance of including an additional factor in the model, Fisher's private criterion is calculated Fxi. It is based on comparing the increase in factorial variance, due to the inclusion of an additional factor in the model, with the residual variance per one degree of freedom for the regression as a whole. Therefore, the calculation formula private F-criterion for the factor will look like this:

Where R 2 yx 1 x 2… xi … xp - multi-determination coefficient for a model with a full set P factors ; R 2 yx 1 x 2… x i -1 x i +1… xp- coefficient of multiple determination for a model that does not include a factor x i;P- number of observations; T- number of parameters at factors x in the regression equation.

The actual value of Fisher's partial criterion is compared with the tabular one at a significance level of 0.05 or 0.1 and the corresponding numbers of degrees of freedom. If the actual value Fxi exceeds F table, then the additional inclusion of the factor x i into the model is statistically justified, and the "pure" regression coefficient b i with a factor x i statistically significant. If Fxi less F table, then the additional inclusion of the factor in the model does not significantly increase the proportion of the explained variation of the result y, and, therefore, its inclusion in the model does not make sense, the regression coefficient for this factor in this case is statistically insignificant.

Fisher's partial test can test for the significance of all regression coefficients, assuming that each corresponding factor x i is entered last into the multiple regression equation, and all other factors have already been included in the model before.

Estimation of the significance of the "pure" regression coefficients b i By Student's criterion t can be carried out without calculating private F-criteria. In this case, as in paired regression, the formula is applied for each factor

t bi = b i / m bi ,

Where b i- coefficient of "pure" regression with a factor x i ; m bi- standard error of the regression coefficient b i .

You can check the significance of the parameters of the regression equation using t-statistics.

Exercise:
For a group of enterprises producing the same type of product, cost functions are considered:
y = α + βx;
y = α x β ;
y = α β x ;
y = α + β / x;
where y is the cost of production, thousand cu.
x - output, thousand units.

Required:
1. Build paired regression equations y from x:

linear;
power;
indicative;
equilateral hyperbola.

2. Calculate the linear pair correlation coefficient and the coefficient of determination. Draw conclusions.
3. Assess the statistical significance of the regression equation as a whole.
4. Assess the statistical significance of the regression and correlation parameters.
5. Perform a forecast of production costs with a forecast output of 195% of the average level.
6. Assess the accuracy of the forecast, calculate the forecast error and confidence interval.
7. Evaluate the model through the average approximation error.

Solution:

1. The equation has the form y = α + βx
1. Parameters of the regression equation.
Averages

Dispersion

standard deviation

Correlation coefficient

The relationship between trait Y factor X is strong and direct
Regression Equation

Determination coefficient
R 2 = 0.94 2 = 0.89, i.e. in 88.9774% of cases, changes in x lead to a change in y. In other words - the accuracy of the selection of the regression equation is high

x	y	x2	y2	x y	y(x)	(y-y cp) 2	(y-y(x)) 2	(x-x p) 2
78	133	6084	17689	10374	142.16	115.98	83.83	1
82	148	6724	21904	12136	148.61	17.9	0.37	9
87	134	7569	17956	11658	156.68	95.44	514.26	64
79	154	6241	23716	12166	143.77	104.67	104.67	0
89	162	7921	26244	14418	159.9	332.36	4.39	100
106	195	11236	38025	20670	187.33	2624.59	58.76	729
67	139	4489	19321	9313	124.41	22.75	212.95	144
88	158	7744	24964	13904	158.29	202.51	0.08	81
73	152	5329	23104	11096	134.09	67.75	320.84	36
87	162	7569	26244	14094	156.68	332.36	28.33	64
76	159	5776	25281	12084	138.93	231.98	402.86	9
115	173	13225	29929	19895	201.86	854.44	832.66	1296
		0	0	0	16.3	20669.59	265.73	6241
1027	1869	89907	294377	161808	1869	25672.31	2829.74	8774

Note: y(x) values are found from the resulting regression equation:
y(1) = 4.01*1 + 99.18 = 103.19
y(2) = 4.01*2 + 99.18 = 107.2
... ... ...

2. Estimating the parameters of the regression equation
Significance of the correlation coefficient

According to the Student's table, we find Ttable
T table (n-m-1; α / 2) \u003d (11; 0.05 / 2) \u003d 1.796
Since Tobs > Ttabl, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant.

Analysis of the accuracy of determining estimates of regression coefficients

Sa = 0.1712
Confidence intervals for the dependent variable

Let us calculate the boundaries of the interval in which 95% of the possible values of Y will be concentrated for an unlimited number of observations and X = 1
(-20.41;56.24)
Hypothesis testing regarding the coefficients of the linear regression equation
1) t-statistic

The statistical significance of the regression coefficient a is confirmed

The statistical significance of the regression coefficient b is not confirmed
Confidence interval for coefficients of the regression equation
Let us determine the confidence intervals of the regression coefficients, which, with 95% reliability, will be as follows:
(a - t S a ; a + t S a)
(1.306;1.921)
(b - t b S b ; b + t b S b)
(-9.2733;41.876)
where t = 1.796
2) F-statistics

fkp = 4.84
Since F > Fkp, the coefficient of determination is statistically significant

TOPIC 4. STATISTICAL METHODS FOR STUDYING RELATIONSHIPS

Regression Equation - this is an analytical representation of the correlation dependence. The regression equation describes a hypothetical functional relationship between the conditional average value of the effective feature and the value of the feature - factor (factors), i.e. the underlying trend of addiction.

Pair correlation dependence is described by the pair regression equation, multiple correlation dependence - by the multiple regression equation.

The feature-result in the regression equation is the dependent variable (response, explanatory variable), and the feature-factor is the independent variable (argument, explanatory variable).

The simplest type of regression equation is the equation of a paired linear relationship:

where y is the dependent variable (sign-result); x is an independent variable (sign-factor); and are the parameters of the regression equation; - Estimation error.

Various mathematical functions can be used as a regression equation. The equations of linear dependence, parabola, hyperbola, steppe function, etc. find frequent practical application.

As a rule, the analysis begins with a linear relationship, since the results are easy to interpret meaningfully. The choice of the type of the constraint equation is a rather important step in the analysis. In the "pre-computer" era, this procedure was associated with certain difficulties and required the analyst to know the properties of mathematical functions. At present, on the basis of specialized programs, it is possible to quickly construct a set of communication equations and, based on formal criteria, make a choice best model(however, the mathematical literacy of the analyst has not lost its relevance).

A hypothesis about the type of correlation dependence can be put forward based on the results of constructing the correlation field (see lecture 6). Based on the nature of the location of the points on the graph (the coordinates of the points correspond to the values of the dependent and independent variables), the trend of the relationship between the signs (indicators) is revealed. If the regression line passes through all points of the correlation field, then this indicates a functional relationship. In the practice of socio-economic research, such a picture cannot be observed, since there is a statistical (correlation) dependence. Under the conditions of correlation dependence, when drawing a regression line on a scatterplot, a deviation of the points of the correlation field from the regression line is observed, which demonstrates the so-called residuals or estimation errors (see Figure 7.1).

The presence of an equation error is due to the fact that:

§ not all factors influencing the result are taken into account in the regression equation;

§ the form of connection may be incorrectly chosen - the regression equation;

§ Not all factors are included in the equation.

To construct a regression equation means to calculate the values of its parameters. The regression equation is built on the basis of the actual values of the analyzed features. The calculation of parameters is usually performed using method of least squares (LSM).

The essence of the MNC is that it is possible to obtain such values of the parameters of the equation, at which the sum of the squared deviations of the theoretical values of the attribute-result (calculated on the basis of the regression equation) from its actual values is minimized:

where is the actual value of the feature-result y i-th unit aggregates; - the value of the sign-result of the i-th unit of the population, obtained by the regression equation ().

Thus, the problem is solved for an extremum, that is, it is necessary to find at what values of the parameters, the function S reaches a minimum.

Carrying out differentiation, equating the partial derivatives to zero:

, (7.3)

, (7.4)

where is the average product of the factor and result values; - the average value of the sign - factor; - the average value of the sign-result; - variance of the sign-factor.

The parameter in the regression equation characterizes the slope of the regression line on the graph. This option is called regression coefficient and its value characterizes by how many units of its measurement the sign-result will change when the sign-factor changes by the unit of its measurement. The sign of the regression coefficient reflects the direction of the dependence (direct or inverse) and coincides with the sign of the correlation coefficient (under conditions of paired dependence).

As part of the example under consideration, the STATISTICA program calculated the parameters of the regression equation that describes the relationship between the level of average per capita monetary income of the population and the value of the gross regional product per capita in the regions of Russia, see Table 7.1.

Table 7.1 - Calculation and evaluation of the parameters of the equation describing the relationship between the level of average per capita monetary income of the population and the value of the gross regional product per capita in the regions of Russia, 2013

Column "B" of the table contains the values of the parameters of the pair regression equation, therefore, you can write: = 13406.89 + 22.82 x. This equation describes the trend of the relationship between the analyzed characteristics. The parameter is the regression coefficient. In this case, it is equal to 22.82 and characterizes the following: with an increase in GRP per capita by 1 thousand rubles, average per capita cash incomes increase on average (as indicated by the "+" sign) by 22.28 rubles.

The parameter of the regression equation in socio-economic studies, as a rule, is not meaningfully interpreted. Formally, it reflects the value of the sign - the result, provided that the sign - factor is equal to zero. The parameter characterizes the location of the regression line on the graph, see Figure 7.1.

Figure 7.1 - Correlation field and regression line, reflecting the dependence of the level of average per capita monetary income of the population in the regions of Russia and the value of GRP per capita

The parameter value corresponds to the point of intersection of the regression line with the Y-axis, at X=0.

The construction of the regression equation is accompanied by an assessment of the statistical significance of the equation as a whole and its parameters. The need for such procedures is associated with a limited amount of data, which may prevent the operation of the law of large numbers and, therefore, the identification of a true trend in the relationship of the analyzed indicators. In addition, any study population can be considered as a sample of population, and the characteristics obtained during the analysis as an estimate of the general parameters.

The assessment of the statistical significance of the parameters and the equation as a whole is the substantiation of the possibility of using the constructed communication model for making managerial decisions and forecasting (modeling).

Statistical Significance of the Regression Equation in general is estimated using Fisher F-test, which is the ratio of the factorial and residual variances calculated for one degree of freedom:

Where - factor variance of the feature - result; k is the number of degrees of freedom of factorial dispersion (the number of factors in the regression equation); - the mean value of the dependent variable; - theoretical (obtained by the regression equation) value of the dependent variable for the i-th unit of the population; - residual variance of the sign - result; n is the volume of the population; n-k-1 is the number of degrees of freedom of the residual dispersion.

The value of Fisher's F-test, according to the formula, characterizes the ratio between the factor and residual variances of the dependent variable, demonstrating, in essence, how many times the value of the explained part of the variation exceeds the unexplained one.

Fisher's F-test is tabulated, the input to the table is the number of degrees of freedom of the factorial and residual variances. Comparison of the calculated value of the criterion with the tabular (critical) one allows answering the question: is that part of the variation of the trait-result that can be explained by the factors included in the equation of this type statistically significant? If , then the regression equation is recognized as statistically significant and, accordingly, the coefficient of determination is also statistically significant. Otherwise ( ), the equation is statistically insignificant, i.e. the variation of the factors taken into account in the equation does not explain the statistically significant part of the variation of the trait-result, or the relationship equation is not correctly chosen.

Estimation of the statistical significance of the parameters of the equation carried out on the basis t-statistics, which is calculated as the ratio of the absolute value of the parameters of the regression equation to their standard errors ( ):

, Where ; (7.6)

, Where ; (7.7)

Where - standard deviations sign - factor and sign - result; - coefficient of determination.

In specialized statistical programs, the calculation of parameters is always accompanied by the calculation of their standard (root-mean-square) errors and t-statistics (see Table 7.1). The calculated value of t-statistics is compared with the tabular one, if the volume of the studied population is less than 30 units (definitely a small sample), one should refer to the Student's t-distribution table, if the volume of the population is large, one should use the normal distribution table (Laplace's probability integral). An equation parameter is considered statistically significant if.

Estimation of parameters based on t-statistics, in essence, is a test of the null hypothesis about the equality of the general parameters to zero (H 0: =0; H 0: =0;), that is, about a statistically insignificant value of the parameters of the regression equation. The significance level of the hypothesis, as a rule, is taken: = 0.05. If the calculated significance level is less than 0.05, then the null hypothesis is rejected and the alternative one is accepted - about the statistical significance of the parameter.

Let's continue with the example. Table 7.1 in column "B" shows the values of the parameters, in the column Std.Err.ofB - the values of the standard errors of the parameters ( ), in the column t (77 - the number of degrees of freedom) the values of t - statistics are calculated taking into account the number of degrees of freedom. To assess the statistical significance of the parameters, the calculated values of t-statistics must be compared with the table value. The given level of significance (0.05) in the normal distribution table corresponds to t = 1.96. Since 18.02, 10.84, i.e. , one should recognize the statistical significance of the obtained parameter values, i.e. these values are formed under the influence of non-random factors and reflect the trend of the relationship between the analyzed indicators.

To assess the statistical significance of the equation as a whole, we turn to the value of Fisher's F-test (see Table 7.1). The calculated value of the F-criterion = 117.51, the tabular value of the criterion, based on the corresponding number of degrees of freedom (for factor variance d.f. =1, for residual variance d.f. =77), is 4.00 (see Appendix .... .). Thus, , therefore, the regression equation as a whole is statistically significant. In such a situation, we can also talk about the statistical significance of the value of the coefficient of determination, i.e. The 60 percent variation in average per capita incomes of the population in the regions of Russia can be explained by the variation in the volume of gross regional product per capita.

By assessing the statistical significance of the regression equation and its parameters, we can get a different combination of results.

· Equation by F-test is statistically significant and all parameters of the equation by t-statistics are also statistically significant. This equation can be used both for making managerial decisions (which factors should be influenced in order to obtain the desired result), and for predicting the behavior of the result attribute for certain values of the factors.

· According to the F-criterion, the equation is statistically significant, but the parameters (parameter) of the equation are insignificant. The equation can be used to make management decisions (concerning those factors for which the statistical significance of their influence is confirmed), but the equation cannot be used for forecasting.

· The F-test equation is not statistically significant. The equation cannot be used. The search for significant signs-factors or an analytical form of the connection between the argument and the response should be continued.

If the statistical significance of the equation and its parameters is confirmed, then the so-called point forecast can be implemented, i.e. an estimate of the value of the attribute-result (y) was obtained for certain values of the factor (x).

It is quite obvious that the predicted value of the dependent variable, calculated on the basis of the relation equation, will not coincide with its actual value ( ). Graphically, this situation is confirmed by the fact that not all points of the correlation field lie on the regression line, only with a functional connection the regression line will pass through all points of the scatter diagram. The presence of discrepancies between the actual and theoretical values of the dependent variable is primarily due to the very essence of the correlation dependence: at the same time, many factors affect the result, of which only a part can be taken into account in a specific relationship equation. In addition, the form of the relationship between the result and the factor (the type of regression equation) may be incorrectly chosen. In this regard, the question arises of how informative the constructed constraint equation is. This question is answered by two indicators: the coefficient of determination (it has already been discussed above) and the standard error of estimation.

The difference between the actual and theoretical values of the dependent variable is called deviations or errors, or leftovers. Based on these values, the residual variance is calculated. Square root from the residual variance and is root-mean-square (standard) estimation error:

= (7.8)

The standard error of the equation is measured in the same units as the predicted rate. If the equation errors follow a normal distribution (with large amounts of data), then 95 percent of the values should be from the regression line at a distance not exceeding 2S (based on the property of a normal distribution - the rule of three sigma). Value standard error estimation is used in the calculation of confidence intervals when predicting the value of a sign - the result for a particular unit of the population.

In practical research, it often becomes necessary to predict the average value of a feature - the result for a particular value of the feature - factor. In this case, in the calculation of the confidence interval for the mean value of the dependent variable()

the value of the average error is taken into account:

(7.9)

The use of different error values is explained by the fact that the variability of the levels of indicators for specific units of the population is much higher than the variability of the mean value, therefore, the forecast error of the mean value is smaller.

Confidence interval of the forecast of the mean value of the dependent variable:

, (7.10)

Where - marginal estimation error (see sampling theory); t is the confidence coefficient, the value of which is in the corresponding table, based on the level of probability adopted by the researcher (number of degrees of freedom) (see sampling theory).

The confidence interval for the predicted value of the result attribute can also be calculated taking into account the correction for the shift (shift) of the regression line. The value of the correction factor is determined by:

(7.11)

where is the value of the attribute-factor, based on which the value of the attribute-result is predicted.

It follows that the more the value differs from the average value of the attribute-factor, the greater the value of the correction factor, the greater the forecast error. Taking into account given coefficient the confidence interval of the forecast will be calculated:

The accuracy of the forecast based on the regression equation can be affected by different reasons. First of all, it should be taken into account that the evaluation of the quality of the equation and its parameters is carried out based on the assumption of normal distribution random leftovers. Violation of this assumption may be due to the presence of sharply different values in the data, with non-uniform variation, with the presence of a non-linear relationship. In this case, the quality of the forecast is reduced. The second thing to keep in mind is that the values of the factors taken into account when predicting the result should not go beyond the range of variation of the data on which the equation is based.

Assessment of the statistical significance of the regression equation and its parameters. Assessment of the significance of the regression equation and its coefficients

Estimating the Significance of an Equation multiple regression

What can you buy on ng. What do they give for the new year? What can you give for the New Year: ideas for everyone

Lesson Plan The Power in Nature

Laws of Mendel. Fundamentals of genetics. Mendel's first law 1 Mendel's law crossbreeding scheme

Laws of Mendel. Fundamentals of genetics. First and second laws of mendel 1 law of mendel definition

Categories