Dispersion analysis. One-way analysis of variance

All people naturally seek knowledge. (Aristotle. Metaphysics)

Analysis of variance

Introductory overview

In this section, we will review the basic methods, assumptions, and terminology of ANOVA.

Note that in English literature analysis of variance commonly referred to as analysis of variation. Therefore, for brevity, below we will sometimes use the term ANOVA (An alysis o f va riation) for conventional ANOVA and the term MANOVA for multivariate analysis of variance. In this section, we will sequentially consider the main ideas of the analysis of variance ( ANOVA), analysis of covariance ( ANCOVA), multivariate analysis of variance ( MANOVA) and multivariate covariance analysis ( MANCOVA). After a brief discussion of the merits of contrast analysis and post hoc tests, let's look at the assumptions on which ANOVA methods are based. Towards the end of this section, the advantages of the multivariate approach for repeated measures analysis are explained over the traditional one-dimensional approach.

Key Ideas

The purpose of the analysis of variance. The main purpose of the analysis of variance is to study the significance of the difference between the means. Chapter (Chapter 8) provides a brief introduction to the study statistical significance. If you are just comparing the means of two samples, analysis of variance will give the same result as normal analysis. t- criterion for independent samples (if two independent groups of objects or observations are compared), or t- criterion for dependent samples (if two variables are compared on the same set of objects or observations). If you are not familiar with these criteria, we recommend that you refer to the introductory overview of the chapter (Chapter 9).

Where did the name come from Analysis of variance? It may seem strange that the procedure for comparing means is called analysis of variance. In fact, this is due to the fact that when we examine the statistical significance of the difference between the means, we are actually analyzing the variances.

Splitting the sum of squares

For a sample size of n, the sample variance is calculated as the sum of the squared deviations from the sample mean divided by n-1 (sample size minus one). Thus, for a fixed sample size n, the variance is a function of the sum of squares (deviations), denoted, for brevity, SS(from English Sum of Squares - Sum of Squares). The analysis of variance is based on the division (or splitting) of the variance into parts. Consider the following data set:

The means of the two groups are significantly different (2 and 6, respectively). Sum of squared deviations inside of each group is 2. Adding them together, we get 4. If we now repeat these calculations excluding group membership, that is, if we calculate SS based on the combined mean of the two samples, we get 28. In other words, the variance (sum of squares) based on within-group variability results in much smaller values ​​than when calculated based on total variability (relative to the overall mean). The reason for this is obviously the significant difference between the means, and this difference between the means explains the existing difference between the sums of squares. Indeed, if we use the module Analysis of variance, the following results will be obtained:

As can be seen from the table, the total sum of squares SS=28 divided into the sum of squares due to intragroup variability ( 2+2=4 ; see second row of the table) and the sum of squares due to the difference in the mean values. (28-(2+2)=24; see the first line of the table).

SS mistakes andSS effect. Intragroup variability ( SS) is usually called the variance errors. This means that it usually cannot be predicted or explained when an experiment is carried out. On the other side, SS effect(or intergroup variability) can be explained by the difference between the means in the studied groups. In other words, belonging to a certain group explains intergroup variability, because we know that these groups have different means.

Significance check. The main ideas of testing for statistical significance are discussed in the chapter Elementary concepts of statistics(Chapter 8). The same chapter explains the reasons why many tests use the ratio of explained and unexplained variance. An example of this use is the analysis of variance itself. Significance testing in ANOVA is based on comparing the variance due to between-group variation (called mean square effect or MSEffect) and dispersion due to within-group spread (called mean square error or MSerror). If the null hypothesis is true (equality of the means in the two populations), then we can expect a relatively small difference in the sample means due to random variability. Therefore, under the null hypothesis, the intra-group variance will practically coincide with the total variance calculated without taking into account the group membership. The resulting within-group variances can be compared using F- test that checks whether the ratio of variances is significantly greater than 1. In the above example, F- The test shows that the difference between the means is statistically significant.

Basic logic of ANOVA. Summing up, we can say that the purpose of analysis of variance is to test the statistical significance of the difference between the means (for groups or variables). This check is carried out using analysis of variance, i.e. by splitting the total variance (variation) into parts, one of which is due to random error (i.e., intragroup variability), and the second is associated with the difference in mean values. The last component of the variance is then used to analyze the statistical significance of the difference between the means. If this difference is significant, the null hypothesis is rejected and the alternative hypothesis that there is a difference between the means is accepted.

Dependent and independent variables. Variables whose values ​​are determined by measurements during an experiment (for example, a score scored on a test) are called dependent variables. Variables that can be manipulated in an experiment (for example, training methods or other criteria that allow you to divide observations into groups) are called factors or independent variables. These concepts are described in more detail in the chapter Elementary concepts of statistics(Chapter 8).

Multivariate analysis of variance

In the simple example above, you could immediately calculate the independent sample t-test using the appropriate module option Basic statistics and tables. The results obtained, of course, coincide with the results of the analysis of variance. However, analysis of variance contains flexible and powerful technical means, which can be used for much more complex studies.

Lots of factors. The world is inherently complex and multidimensional. Situations where some phenomenon is completely described by one variable are extremely rare. For example, if we are trying to learn how to grow large tomatoes, we should consider factors related to the genetic structure of plants, soil type, light, temperature, etc. Thus, when conducting a typical experiment, you have to deal with a large number of factors. The main reason why using ANOVA is preferable to re-comparing two samples at different levels of factors using t- criterion is that the analysis of variance is more effective and, for small samples, more informative.

Factor management. Let's assume that in the example of two-sample analysis discussed above, we add one more factor, for example, Floor- Gender. Let each group consist of 3 men and 3 women. The design of this experiment can be presented in the form of a 2 by 2 table:

Experiment. Group 1 Experiment. Group 2
Men2 6
3 7
1 5
Average2 6
Women4 8
5 9
3 7
Average4 8

Before doing the calculations, you can see that in this example the total variance has at least three sources:

(1) random error (within group variance),

(2) variability associated with membership in the experimental group, and

(3) variability due to the gender of the observed objects.

(Note that there is another possible source of variability - interaction of factors, which we will discuss later). What happens if we don't include floorgender as a factor in the analysis and calculate the usual t-criterion? If we calculate sums of squares, ignoring floor -gender(i.e., combining objects of different sexes into one group when calculating the within-group variance, while obtaining the sum of squares for each group equal to SS=10, and the total sum of squares SS= 10+10 = 20), then we get a larger value of intragroup dispersion than in a more accurate analysis with additional division into subgroups according to semi- gender(in this case, the intragroup means will be equal to 2, and the total intragroup sum of squares will be equal to SS = 2+2+2+2 = 8). This difference is due to the fact that the mean value for men - males less than the average for women -female, and this difference in means increases the total within-group variability if sex is not taken into account. Controlling the error variance increases the sensitivity (power) of the test.

This example shows another advantage of analysis of variance over conventional analysis. t-criterion for two samples. Analysis of variance allows you to study each factor by controlling the values ​​of other factors. This, in fact, is the main reason for its greater statistical power (smaller sample sizes are required to obtain meaningful results). For this reason, analysis of variance, even on small samples, gives statistically more significant results than a simple one. t- criterion.

Interaction effects

There is another advantage of using ANOVA over conventional analysis. t- criterion: analysis of variance allows you to detect interaction between the factors and therefore allows more complex models to be studied. To illustrate, consider another example.

Main effects, pairwise (two-factor) interactions. Let us assume that there are two groups of students, and psychologically the students of the first group are tuned in to the fulfillment of the assigned tasks and are more purposeful than the students of the second group, which consists of lazier students. Let's divide each group randomly in half and offer one half of each group a difficult task, and the other an easy one. After that, we measure how hard students work on these tasks. The averages for this (fictitious) study are shown in the table:

What conclusion can be drawn from these results? Is it possible to conclude that: (1) students work harder on a difficult task; (2) do motivated students work harder than lazy ones? None of these statements reflect the essence of the systematic nature of the averages given in the table. Analyzing the results, it would be more correct to say that only motivated students work harder on complex tasks, while only lazy students work harder on easy tasks. In other words, the nature of the students and the complexity of the task interacting each other affect the amount of effort required. That's an example pair interaction between the nature of students and the complexity of the task. Note that statements 1 and 2 describe main effects.

Interactions of higher orders. While pairwise interactions are relatively easy to explain, higher-order interactions are much more difficult to explain. Let us imagine that in the example considered above, one more factor is introduced floor -Gender and we got the following table of averages:

What conclusions can now be drawn from the results obtained? Mean plots make it easy to interpret complex effects. The analysis of variance module allows you to build these graphs with almost one click.

The image in the graphs below represents the three-way interaction under study.

Looking at the graphs, we can tell that there is an interaction between the nature and difficulty of the test for women: motivated women work harder on a difficult task than on an easy one. In men, the same interaction is reversed. It can be seen that the description of the interaction between factors becomes more confusing.

General way descriptions of interactions. IN general case the interaction between factors is described as a change in one effect under the influence of another. In the example discussed above, two-factor interaction can be described as a change in the main effect of the factor characterizing the complexity of the task, under the influence of the factor describing the character of the student. For the interaction of the three factors from the previous paragraph, we can say that the interaction of two factors (the complexity of the task and the character of the student) changes under the influence of genderGender. If the interaction of four factors is studied, we can say that the interaction of three factors changes under the influence of the fourth factor, i.e. there are different types of interactions at different levels of the fourth factor. It turned out that in many areas the interaction of five or even more factors is not unusual.

Complex plans

Intergroup and intragroup plans (remeasurement plans)

When comparing two different groups, one usually uses t- criterion for independent samples (from module Basic statistics and tables). When two variables are compared on the same set of objects (observations), it is used t-criterion for dependent samples. For analysis of variance, it is also important whether or not the samples are dependent. If there are repeated measurements of the same variables (at different conditions or in different time) for the same objects, then they say about the presence repeated measurements factor(also called an intragroup factor since the within-group sum of squares is calculated to evaluate its significance). If different groups of objects are compared (for example, men and women, three strains of bacteria, etc.), then the difference between the groups is described intergroup factor. The methods for calculating the significance criteria for the two types of factors described are different, but their general logic and interpretation are the same.

Inter- and intra-group plans. In many cases, the experiment requires the inclusion of both an between-group factor and a repeated measures factor in the design. For example, the math skills of female and male students are measured (where floor -Gender-intergroup factor) at the beginning and at the end of the semester. The two dimensions of each student's skills form the within-group factor (repeated measures factor). The interpretation of the main effects and interactions for between-group and repeated measures factors is the same, and both types of factors can obviously interact with each other (for example, women gain skills during the semester, and men lose them).

Incomplete (nested) plans

In many cases, the interaction effect can be neglected. This occurs either when it is known that there is no interaction effect in the population, or when the implementation of the full factorial plan is impossible. For example, the effect of four fuel additives on fuel consumption is being studied. Four cars and four drivers are selected. Full factorial the experiment requires that each combination: supplement, driver, car, appear at least once. This requires at least 4 x 4 x 4 = 64 test groups, which is too time consuming. In addition, there is hardly any interaction between the driver and the fuel additive. With this in mind, you can use the plan latin squares, which contains only 16 groups of tests (four additives are designated by the letters A, B, C and D):

Latin squares are described in most experimental design books (eg Hays, 1988; Lindman, 1974; Milliken and Johnson, 1984; Winer, 1962) and will not be discussed in detail here. Note that Latin squares are Notnfull plans that do not include all combinations of factor levels. For example, driver 1 drives car 1 with additive A only, driver 3 drives car 1 with additive C only. Factor levels additives ( A, B, C and D) nested in table cells automobile x driver - like eggs in a nest. This mnemonic rule is useful for understanding the nature nested or nested plans. Module Analysis of variance provides simple ways analysis of plans of this type.

Covariance Analysis

main idea

In chapter Key Ideas there was a brief discussion of the idea of ​​controlling factors and how the inclusion of additive factors can reduce the sum of squared errors and increase the statistical power of the design. All this can be extended to variables with a continuous set of values. When such continuous variables are included as factors in the design, they are called covariates.

Fixed covariates

Suppose that we are comparing the mathematical skills of two groups of students who were taught from two different textbooks. Let's also assume that we have intelligence quotient (IQ) data for each student. We can assume that IQ is related to math skills and use this information. For each of the two groups of students, the correlation coefficient between IQ and math skills can be calculated. Using this correlation coefficient, it is possible to distinguish between the share of variance in groups explained by the influence of IQ and the unexplained share of variance (see also Elementary concepts of statistics(chapter 8) and Basic statistics and tables(Chapter 9)). The remaining fraction of the variance is used in the analysis as the error variance. If there is a correlation between IQ and math skills, then the error variances can be significantly reduced. SS/(n-1) .

Effect of covariates onF- criterion. F- the criterion evaluates the statistical significance of the difference between the mean values ​​in the groups, while the ratio of the intergroup variance is calculated ( MSeffect) to the error variance ( MSerror) . If MSerror decreases, for example, when taking into account the IQ factor, the value F increases.

Lots of covariates. The reasoning used above for a single covariate (IQ) easily extends to multiple covariates. For example, in addition to IQ, you can include the measurement of motivation, spatial thinking, etc. Instead of the usual correlation coefficient, it uses multiple factor correlations.

When the valueF -criteria decreases. Sometimes the introduction of covariates in the design of the experiment reduces the value F- criteria . This usually indicates that the covariates are not only correlated with the dependent variable (such as math skills) but also with factors (such as different textbooks). Assume that IQ is measured at the end of the semester, after two groups of students have spent almost a year studying two different textbooks. Although the students were divided into groups randomly, it may turn out that the difference in textbooks is so great that both IQ and math skills in different groups will vary greatly. In this case, the covariates not only reduce the error variance, but also the between-group variance. In other words, after controlling for the difference in IQ between groups, the difference in math skills will no longer be significant. It can be said otherwise. After “eliminating” the influence of IQ, the influence of the textbook on the development of mathematical skills is inadvertently excluded.

Adjusted averages. When the covariate affects the between-group factor, one should calculate adjusted averages, i.e. such means, which are obtained after removing all estimates of the covariates.

Interaction between covariates and factors. Just as interactions between factors are explored, interactions between covariates and between groups of factors can be explored. Suppose one of the textbooks is especially suitable for smart students. The second textbook is boring for smart students, and the same textbook is difficult for less smart students. As a result, there is a positive correlation between IQ and learning outcomes in the first group (smarter students, better results) and zero or little negative correlation in the second group (the smarter the student, the less likely it is to acquire mathematical skills from the second textbook). In some studies, this situation is discussed as an example of violation of the assumptions of the analysis of covariance. However, since the Analysis of Variance module uses the most common methods of analysis of covariance, it is possible, in particular, to evaluate the statistical significance of the interaction between factors and covariates.

Variable covariates

While fixed covariates are discussed quite often in textbooks, variable covariates are much less frequently mentioned. Usually, when conducting experiments with repeated measurements, we are interested in differences in measurements of the same quantities at different points in time. Namely, we are interested in the significance of these differences. If a covariate measurement is carried out at the same time as the dependent variable measurements, the correlation between the covariate and the dependent variable can be calculated.

For example, you can study interest in mathematics and math skills at the beginning and at the end of the semester. It would be interesting to check whether changes in interest in mathematics are correlated with changes in mathematical skills.

Module Analysis of variance V STATISTICS automatically assesses the statistical significance of changes in covariates in those plans, where possible.

Multivariate Designs: Multivariate ANOVA and Covariance Analysis

Intergroup plans

All examples considered earlier included only one dependent variable. When there are several dependent variables at the same time, only the complexity of the calculations increases, and the content and basic principles do not change.

For example, a study is being conducted on two different textbooks. At the same time, the success of students in the study of physics and mathematics is studied. In this case, there are two dependent variables and you need to find out how two different textbooks affect them simultaneously. To do this, you can use multivariate analysis of variance (MANOVA). Instead of a one-dimensional F criterion, multidimensional F test (Wilks l-test) based on comparison of error covariance matrix and intergroup covariance matrix.

If the dependent variables are correlated with each other, then this correlation should be taken into account when calculating the significance test. Obviously, if the same measurement is repeated twice, then nothing new can be obtained in this case. If a dimension correlated with it is added to an existing dimension, then some new information, but the new variable contains redundant information, which is reflected in the covariance between the variables.

Interpretation of results. If the overall multivariate criterion is significant, we can conclude that the corresponding effect (eg textbook type) is significant. However, the following questions arise. Does the type of textbook affect the improvement of only math skills, only physical skills, or both of them. In fact, after obtaining a meaningful multivariate criterion, for a single main effect or interaction, one-dimensional F criterion. In other words, dependent variables that contribute to the significance of the multivariate test are examined separately.

Plans with repeated measurements

If the mathematical and physical skills of students are measured at the beginning of the semester and at the end, then these are repeated measurements. The study of the criterion of significance in such plans is a logical development of the one-dimensional case. Note that multivariate ANOVA methods are also commonly used to investigate the significance of univariate repeated measures factors that have more than two levels. The corresponding applications will be discussed later in this part.

Summation of variable values ​​and multivariate analysis of variance

Even experienced users of univariate and multivariate ANOVA often get confused by getting different results when applying multivariate ANOVA to, say, three variables, and when applying univariate ANOVA to the sum of the three variables as a single variable.

Idea summation variables is that each variable contains some true variable, which is investigated, as well as a random measurement error. Therefore, when averaging the values ​​of the variables, the measurement error will be closer to 0 for all measurements and the averaged values ​​will be more reliable. In fact, in this case, applying ANOVA to the sum of variables is reasonable and a powerful technique. However, if the dependent variables are multivariate in nature, summing the values ​​of the variables is inappropriate.

For example, let the dependent variables consist of four measures success in society. Each indicator characterizes a completely independent side of human activity (for example, professional success, business success, family well-being, etc.). Adding these variables together is like adding an apple and an orange. The sum of these variables would not be a suitable univariate measure. Therefore, such data must be treated as multidimensional indicators in multivariate analysis of variance.

Contrast analysis and post hoc tests

Why are individual sets of means compared?

Usually hypotheses about experimental data are formulated not simply in terms of main effects or interactions. An example is the following hypothesis: a certain textbook improves mathematical skills only in male students, while another textbook is approximately equally effective for both sexes, but still less effective for men. It can be predicted that textbook performance interacts with student gender. However, this prediction also applies nature interactions. A significant difference between the sexes is expected for students in one book, and practically gender-independent results for students in the other book. This type of hypothesis is usually explored using contrast analysis.

Contrast Analysis

In short, contrast analysis allows us to evaluate the statistical significance of some linear combinations of complex effects. Contrast analysis is the main and indispensable element of any complex ANOVA plan. Module Analysis of variance has quite a variety of contrast analysis capabilities that allow you to select and analyze any type of comparison of averages.

a posteriori comparisons

Sometimes, as a result of processing an experiment, an unexpected effect is discovered. Although in most cases a creative researcher will be able to explain any result, this does not provide opportunities for further analysis and estimates for the forecast. This problem is one of those for which post hoc criteria, that is, criteria that do not use a priori hypotheses. To illustrate, consider the following experiment. Suppose that 100 cards contain numbers from 1 to 10. Having dropped all these cards into the header, we randomly select 20 times 5 cards, and calculate the average value for each sample (the average of the numbers written on the cards). Can we expect that there are two samples whose means are significantly different? This is very plausible! By choosing two samples with the maximum and minimum mean, one can obtain a difference in the means that is very different from the difference in the means, for example, of the first two samples. This difference can be investigated, for example, using contrast analysis. Without going into details, there are several so-called a posteriori criteria that are based exactly on the first scenario (taking extreme averages from 20 samples), i.e. these criteria are based on choosing the most different means to compare all means in the design. These criteria are applied in order not to get an artificial effect purely by chance, for example, to find a significant difference between the means when there is none. Module Analysis of variance offers wide choose such criteria. When unexpected results are encountered in an experiment involving multiple groups, the a posteriori procedures for examining the statistical significance of the results obtained.

Sum of squares type I, II, III and IV

Multivariate regression and analysis of variance

There is a close relationship between the method of multivariate regression and analysis of variance (analysis of variations). In both methods, a linear model is studied. In short, almost all experimental designs can be explored using multivariate regression. Consider the following simple cross-group 2 x 2 plan.

DV A B AxB
3 1 1 1
4 1 1 1
4 1 -1 -1
5 1 -1 -1
6 -1 1 -1
6 -1 1 -1
3 -1 -1 1
2 -1 -1 1

Columns A and B contain codes characterizing the levels of factors A and B, column AxB contains the product of two columns A and B. We can analyze these data using multivariate regression. Variable DV defined as a dependent variable, variables from A before AxB as independent variables. The study of significance for the regression coefficients will coincide with the calculations in the analysis of variance of the significance of the main effects of the factors A And B and interaction effect AxB.

Unbalanced and Balanced Plans

When calculating the correlation matrix for all variables, for example, for the data depicted above, it can be seen that the main effects of the factors A And B and interaction effect AxB uncorrelated. This property of effects is also called orthogonality. They say that the effects A And B - orthogonal or independent from each other. If all effects in the plan are orthogonal to each other, as in the example above, then the plan is said to be balanced.

Balanced plans have the “good property.” The calculations in the analysis of such plans are very simple. All calculations are reduced to calculating the correlation between effects and dependent variables. Since the effects are orthogonal, partial correlations (as in full multidimensional regressions) are not calculated. However, in real life plans are not always balanced.

Consider real data with an unequal number of observations in cells.

Factor A Factor B
B1 B2
A1 3 4, 5
A2 6, 6, 7 2

If we encode this data as above and calculate the correlation matrix for all variables, then it turns out that the design factors are correlated with each other. Factors in the plan are now not orthogonal and such plans are called unbalanced. Note that in this example, the correlation between the factors is entirely related to the difference in the frequencies of 1 and -1 in the columns of the data matrix. In other words, experimental designs with unequal cell volumes (more precisely, disproportionate volumes) will be unbalanced, which means that the main effects and interactions will mix. In this case, to calculate the statistical significance of the effects, you need to fully calculate the multivariate regression. There are several strategies here.

Sum of squares type I, II, III and IV

Sum of squares typeIAndIII. To study the significance of each factor in a multivariate model, one can calculate the partial correlation of each factor, provided that all other factors are already taken into account in the model. You can also enter factors into the model in a step-by-step manner, fixing all the factors already entered into the model and ignoring all other factors. In general, this is the difference between type III And typeI sums of squares (this terminology was introduced in SAS, see for example SAS, 1982; a detailed discussion can also be found in Searle, 1987, p. 461; Woodward, Bonett, and Brecht, 1990, p. 216; or Milliken and Johnson, 1984, p. 138).

Sum of squares typeII. The next “intermediate” model formation strategy is: to control all the main effects in the study of the significance of a single main effect; in the control of all main effects and all pairwise interactions, when the significance of a single pairwise interaction is examined; in controlling all main effects of all pairwise interactions and all interactions of three factors; in the study of a separate interaction of three factors, etc. The sums of squares for effects calculated in this way are called typeII sums of squares. So, typeII sums of squares controls all effects of the same order and below, ignoring all effects of a higher order.

Sum of squares typeIV. Finally, for some special plans with missing cells (incomplete plans), it is possible to calculate the so-called type IV sums of squares. This method will be discussed later in connection with incomplete plans (plans with missing cells).

Interpretation of the sum-of-squares conjecture of types I, II, and III

sum of squares typeIII easiest to interpret. Recall that the sums of squares typeIII examine the effects after controlling for all other effects. For example, after finding a statistically significant typeIII effect for the factor A in the module Analysis of variance, we can say that there is a single significant effect of the factor A, after introducing all other effects (factors) and interpret this effect accordingly. Probably in 99% of all applications of analysis of variance, this type of criterion is of interest to the researcher. This type of sum of squares is usually computed in the module Analysis of variance by default, regardless of whether the option is selected Regression Approach or not (standard approaches adopted in the module Analysis of variance discussed below).

Significant effects obtained using sums of squares type or typeII sums of squares are not so easy to interpret. They are best interpreted in the context of stepwise multivariate regression. If using the sum of squares typeI the main effect of factor B was found to be significant (after including factor A in the model, but before adding the interaction between A and B), it can be concluded that there is a significant main effect of factor B, provided that there is no interaction between factors A and B. (If at using the criterion typeIII, factor B also turned out to be significant, then we can conclude that there is a significant main effect of factor B, after introducing all other factors and their interactions into the model).

In terms of the marginal means of the hypothesis typeI And typeII usually do not have a simple interpretation. In these cases, it is said that one cannot interpret the significance of the effects by considering only the marginal means. rather presented p mean values ​​are related to a complex hypothesis that combines means and sample size. For example, typeII the hypotheses for factor A in the simple 2 x 2 design example discussed earlier would be (see Woodward, Bonett, and Brecht, 1990, p. 219):

nij- number of observations in a cell

uij- average value in a cell

n. j- marginal average

Without going into details (for more details see Milliken and Johnson, 1984, chapter 10), it is clear that these are not simple hypotheses and in most cases none of them is of particular interest to the researcher. However, there are cases where the hypotheses typeI may be of interest.

The default computational approach in the module Analysis of variance

Default if option is not checked Regression Approach, module Analysis of variance uses cell average model. It is characteristic of this model that the sums of squares for different effects are calculated for linear combinations of cell means. In a full factorial experiment, this results in sums of squares that are the same as the sums of squares discussed earlier as type III. However, in the option Scheduled Comparisons(in the window Analysis of variance results), the user can hypothesize about any linear combination of weighted or unweighted cell means. Thus, the user can test not only hypotheses typeIII, but hypotheses of any type (including typeIV). This general approach is particularly useful when examining designs with missing cells (so-called incomplete designs).

For full factorial designs, this approach is also useful when one wants to analyze weighted marginal means. For example, suppose that in the simple 2 x 2 design considered earlier, we want to compare the weighted (in terms of factor levels) B) marginal averages for factor A. This is useful when the distribution of observations over cells was not prepared by the experimenter, but was built randomly, and this randomness is reflected in the distribution of the number of observations by levels of factor B in the aggregate.

For example, there is a factor - the age of widows. A possible sample of respondents is divided into two groups: younger than 40 and older than 40 (factor B). The second factor (factor A) in the plan is whether or not widows received social support from some agency (while some widows were randomly selected, others served as controls). In this case, the age distribution of widows in the sample reflects the actual age distribution of widows in the population. Assessing the effectiveness of the social support group for widows all ages will correspond to the weighted average of the two age groups(with weights corresponding to the number of observations in the group).

Scheduled Comparisons

Note that the sum of the entered contrast ratios is not necessarily equal to 0 (zero). Instead, the program will automatically make adjustments so that the corresponding hypotheses do not mix with the overall average.

To illustrate this, let's go back to the simple 2 x 2 plan discussed earlier. Recall that the cell counts of this unbalanced design are -1, 2, 3, and 1. Let's say we want to compare the weighted marginal averages for factor A (weighted by the frequency of factor B levels). You can enter contrast ratios:

Note that these coefficients do not add up to 0. The program will set the coefficients so that they add up to 0, while maintaining their relative values, i.e.:

1/3 2/3 -3/4 -1/4

These contrasts will compare the weighted averages for factor A.

Hypotheses about the principal mean. The hypothesis that the unweighted principal mean is 0 can be explored using coefficients:

The hypothesis that the weighted principal mean is 0 is tested with:

In no case does the program correct the contrast ratios.

Analysis of plans with missing cells (incomplete plans)

Factorial designs containing empty cells (processing of combinations of cells in which there are no observations) are called incomplete. In such designs, some factors are usually not orthogonal and some interactions cannot be calculated. Doesn't exist at all best method analysis of such plans.

Regression Approach

In some older programs that are based on the analysis of ANOVA designs using multivariate regression, the factors in incomplete designs are set by default in the usual way (as if the plan were complete). Then a multivariate regression analysis for these fictitiously encoded factors. Unfortunately, this method leads to results that are very difficult, if not impossible, to interpret because it is not clear how each effect contributes to the linear combination of means. Consider the following simple example.

Factor A Factor B
B1 B2
A1 3 4, 5
A2 6, 6, 7 Missed

If multivariate regression of the form Dependent variable = Constant + Factor A + Factor B, then the hypothesis about the significance of factors A and B in terms of linear combinations of means looks like this:

Factor A: Cell A1,B1 = Cell A2,B1

Factor B: Cell A1,B1 = Cell A1,B2

This case is simple. In more complex plans, it is impossible to actually determine what exactly will be examined.

Mean cells, analysis of variance approach , type IV hypotheses

An approach that is recommended in the literature and seems to be preferable is the study of meaningful (in terms of research tasks) a priori hypotheses about the means observed in the cells of the plan. A detailed discussion of this approach can be found in Dodge (1985), Heiberger (1989), Milliken and Johnson (1984), Searle (1987), or Woodward, Bonett, and Brecht (1990). Sums of squares associated with hypotheses about a linear combination of means in incomplete designs, investigating estimates of part of the effects, are also called sums of squares. IV.

Automatic generation of type hypothesesIV. When multivariate designs have a complex missing cell pattern, it is desirable to define orthogonal (independent) hypotheses whose investigation is equivalent to the investigation of main effects or interactions. Algorithmic (computational) strategies (based on the pseudo-inverse design matrix) have been developed to generate appropriate weights for such comparisons. Unfortunately, the final hypotheses are not uniquely determined. Of course, they depend on the order in which the effects were defined and are rarely easy to interpret. Therefore, it is recommended to carefully study the nature of the missing cells, then formulate hypotheses typeIV, that are most relevant to the objectives of the study. Then explore these hypotheses using the option Scheduled Comparisons in the window results. Most easy way specify comparisons in this case - require the introduction of a vector of contrasts for all factors together in the window Scheduled comparisons. After calling the dialog box Scheduled Comparisons all groups of the current plan will be shown and those that are omitted will be marked.

Skipped Cells and Specific Effect Check

There are several types of plans in which the location of the missing cells is not random, but carefully planned, which allows a simple analysis of the main effects without affecting other effects. For example, when the required number of cells in a plan is not available, plans are often used. latin squares to estimate the main effects of several factors with a large number of levels. For example, a 4 x 4 x 4 x 4 factorial design requires 256 cells. At the same time, you can use Greco-Latin square to estimate the main effects, having only 16 cells in the plan (chap. Experiment planning, volume IV, contains detailed description such plans). Incomplete designs in which the main effects (and some interactions) can be estimated using simple linear combinations of means are called balanced incomplete plans.

In balanced designs, the standard (default) method of generating contrasts (weights) for main effects and interactions will then produce a variance table analysis in which the sums of squares for the respective effects do not mix with each other. Option Specific Effects window results will generate missing contrasts by writing zero to the missing plan cells. Immediately after the option is requested Specific Effects for a user studying some hypothesis, a table of results appears with the actual weights. Note that in a balanced design, the sums of squares of the respective effects are computed only if those effects are orthogonal (independent) to all other principal effects and interactions. Otherwise, use the option Scheduled Comparisons to explore meaningful comparisons between means.

Missing Cells and Combined Error Effects/Members

If option Regression approach in the launch panel of the module Analysis of variance is not selected, the cell averages model will be used when calculating the sum of squares for the effects (default setting). If the design is not balanced, then when combining non-orthogonal effects (see above discussion of the option Missing cells and specific effect) one can obtain a sum of squares consisting of non-orthogonal (or overlapping) components. The results obtained in this way are usually not interpretable. Therefore, one must be very careful when choosing and implementing complex incomplete experimental designs.

There are many books with detailed discussions of different types of plans. (Dodge, 1985; Heiberger, 1989; Lindman, 1974; Milliken and Johnson, 1984; Searle, 1987; Woodward and Bonett, 1990), but this kind of information is outside the scope of this textbook. However, an analysis of different types of plans will be demonstrated later in this section.

Assumptions and Assumption Violation Effects

Deviation from the assumption of normal distributions

Assume that the dependent variable is measured on a numerical scale. Let's also assume that the dependent variable has a normal distribution within each group. Analysis of variance contains a wide range of graphs and statistics to substantiate this assumption.

Violation effects. At all F the criterion is very resistant to deviation from normality (see Lindman, 1974 for detailed results). If the kurtosis is greater than 0, then the value of the statistic F may become very small. The null hypothesis is accepted, although it may not be true. The situation is reversed when the kurtosis is less than 0. The skewness of the distribution usually has little effect on F statistics. If the number of observations in a cell is large enough, then the deviation from normality does not matter much due to central limit theorem, according to which, the distribution of the mean value is close to normal, regardless of the initial distribution. Detailed discussion of sustainability F statistics can be found in Box and Anderson (1955), or Lindman (1974).

Homogeneity of dispersion

Assumptions. It is assumed that the variances of different groups of the plan are the same. This assumption is called the assumption dispersion homogeneity. Recall that at the beginning of this section, when describing the calculation of the sum of squared errors, we performed summation within each group. If the variances in two groups differ from each other, then adding them is not very natural and does not give an estimate of the total within-group variance (since in this case there is no general variance at all). Module Dispersion analysis -ANOVA/MANOVA contains a large set of statistical criteria for detecting deviation from the assumptions of homogeneity of variance.

Violation effects. Lindman (1974, p. 33) shows that F the criterion is quite stable with respect to the violation of the assumptions of homogeneity of the variance ( heterogeneity dispersion, see also Box, 1954a, 1954b; Hsu, 1938).

Special case: correlation of means and variances. There are times when F statistics can mislead. This happens when the mean values ​​in the design cells are correlated with the variance. Module Analysis of variance allows you to plot variance or standard deviation scatterplots against means to detect such a correlation. The reason why such a correlation is dangerous is as follows. Let's imagine that there are 8 cells in the plan, 7 of which have almost the same average, and in one cell the average is much larger than the rest. Then F the test can detect a statistically significant effect. But suppose that in a cell with a large mean value and the variance is much larger than the others, i.e. the mean and variance in the cells are dependent (the larger the mean, the greater the variance). In this case, the large mean is unreliable, as it may be caused by a large variance in the data. However F statistics based on united variance within cells will capture a large mean, although criteria based on variance in each cell will not consider all differences in the means to be significant.

This nature of the data (large mean and large variance) is often encountered when there are outlier observations. One or two outlier observations strongly shift the mean and greatly increase the variance.

Homogeneity of variance and covariance

Assumptions. In multivariate designs, with multivariate dependent measures, the homogeneity of variance assumptions described earlier also apply. However, since there are multivariate dependent variables, it is also required that their cross-correlations (covariances) be uniform across all plan cells. Module Analysis of variance offers different ways to test these assumptions.

Violation effects. Multidimensional analog F- criterion - λ-test of Wilks. Not much is known about the stability (robustness) of the Wilks λ-test with respect to the violation of the above assumptions. However, since the interpretation of module results Analysis of variance is usually based on the significance of one-dimensional effects (after establishing the significance general criterion), the discussion of robustness concerns mainly one-dimensional analysis of variance. Therefore, the significance of one-dimensional effects should be carefully examined.

Special case: analysis of covariance. Particularly severe violations of the homogeneity of variance/covariance can occur when covariates are included in the design. In particular, if the correlation between covariates and dependent measures is different in different cells of the design, misinterpretation of the results may follow. It should be remembered that in the analysis of covariance, in essence, a regression analysis is performed within each cell in order to isolate that part of the variance that corresponds to the covariate. The homogeneity of variance/covariance assumption assumes that this regression analysis is performed under the following constraint: all regression equations (slopes) for all cells are the same. If this is not intended, then large errors may occur. Module Analysis of variance has several special criteria to test this assumption. It may be advisable to use these criteria in order to make sure that the regression equations for different cells are approximately the same.

Sphericity and complex symmetry: reasons for using a multivariate repeated measures approach in analysis of variance

In designs containing repeated measures factors with more than two levels, the application of univariate analysis of variance requires additional assumptions: complex symmetry assumptions and sphericity assumptions. These assumptions are rarely met (see below). Therefore, in last years multivariate analysis of variance has gained popularity in such plans (both approaches are combined in the module Analysis of variance).

Complex symmetry assumption The complex symmetry assumption is that the variances (total within-group) and covariances (by group) for different repeated measures are uniform (the same). This is a sufficient condition for the univariate F test for repeated measures to be valid (ie, the F-values ​​reported are, on average, consistent with the F-distribution). However, in this case this condition is not necessary.

Assumption of sphericity. The assumption of sphericity is a necessary and sufficient condition for the F-criterion to be justified. It consists in the fact that within the groups all observations are independent and equally distributed. The nature of these assumptions, as well as the impact of their violations, are usually not well described in books on analysis of variance - this one will be described in the following paragraphs. It will also show that the results of the univariate approach may differ from the results of the multivariate approach and explain what this means.

The need for independence of hypotheses. The general way to analyze data in analysis of variance is model fit. If, with respect to the model corresponding to the data, there are some a priori hypotheses, then the variance is split to test these hypotheses (criteria for main effects, interactions). From a computational point of view, this approach generates some set of contrasts (set of comparisons of means in the design). However, if the contrasts are not independent of each other, the partitioning of the variances becomes meaningless. For example, if two contrasts A And B are identical and the corresponding part is selected from the variance, then the same part is selected twice. For example, it is silly and pointless to single out two hypotheses: “the mean in cell 1 is higher than the mean in cell 2” and “the mean in cell 1 is higher than the mean in cell 2”. So the hypotheses must be independent or orthogonal.

Independent hypotheses in repeated measurements. General algorithm implemented in the module Analysis of variance, will try to generate independent (orthogonal) contrasts for each effect. For the repeated measures factor, these contrasts give rise to many hypotheses about differences between the levels of the considered factor. However, if these differences are correlated within groups, then the resulting contrasts are no longer independent. For example, in training where learners are measured three times in one semester, it may happen that changes between 1st and 2nd dimensions are negatively correlated with the change between 2nd and 3rd dimensions of subjects. Those who have mastered most of the material between the 1st and 2nd dimensions master a smaller part during the time that has passed between the 2nd and 3rd dimensions. In fact, for most cases where analysis of variance is used in repeated measurements, it can be assumed that changes in levels are correlated across subjects. However, when this happens, the complex symmetry and sphericity assumptions are not met and independent contrasts cannot be computed.

The impact of violations and ways to correct them. When complex symmetry or sphericity assumptions are not met, analysis of variance can produce erroneous results. Before multivariate procedures were sufficiently developed, several assumptions were made to compensate for violations of these assumptions. (See, for example, Greenhouse & Geisser, 1959 and Huynh & Feldt, 1970). These methods are still widely used today (which is why they are presented in the module Analysis of variance).

Multivariate analysis of variance approach to repeated measures. In general, the problems of complex symmetry and sphericity refer to the fact that the sets of contrasts included in the study of the effects of repeated measures factors (with more than 2 levels) are not independent of each other. However, they do not have to be independent if they are used. multidimensional a criterion for simultaneously testing the statistical significance of two or more repeated measures factor contrasts. This is the reason why multivariate analysis of variance methods have become increasingly used to test the significance of univariate repeated measures factors with more than 2 levels. This approach is widely used because it generally does not require the assumption of complex symmetry and the assumption of sphericity.

Cases in which the multivariate analysis of variance approach cannot be used. There are examples (plans) when the multivariate analysis of variance approach cannot be applied. Usually these are cases where there is no a large number of subjects in the plan and many levels in the repeated measures factor. Then there may be too few observations to perform a multivariate analysis. For example, if there are 12 entities, p = 4 repeated measurements factor, and each factor has k = 3 levels. Then the interaction of 4 factors will “expend” (k-1)P = 2 4 = 16 degrees of freedom. However, there are only 12 subjects, hence a multivariate test cannot be performed in this example. Module Analysis of variance will independently detect these observations and calculate only one-dimensional criteria.

Differences in univariate and multivariate results. If the study includes a large number of repeated measures, there may be cases where the univariate repeated measures approach of ANOVA yields results that are very different from those obtained with the multivariate approach. This means that the differences between the levels of the respective repeated measurements are correlated across subjects. Sometimes this fact is of some independent interest.

Multivariate analysis of variance and structural modeling of equations

In recent years, structural equation modeling has become popular as an alternative to multivariate dispersion analysis (see, for example, Bagozzi and Yi, 1989; Bagozzi, Yi, and Singh, 1991; Cole, Maxwell, Arvey, and Salas, 1993). This approach allows you to test hypotheses not only about the means in different groups, but also about the correlation matrices of dependent variables. For example, you can relax the assumptions about the homogeneity of the variance and covariance and explicitly include errors in the model for each group of variance and covariance. Module STATISTICSStructural Equation Modeling (SEPATH) (see Volume III) allows for such an analysis.

Coursework in mathematics

Introduction

The concept of analysis of variance

One-way analysis of variance (Practical implementation in IBM SPSS Statistics 20)

One-way analysis of variance (Practical implementation in Microsoft Office 2013)

Conclusion

List of sources used

Introduction

Relevance of the topic. The development of mathematical statistics begins with the work of the famous German mathematician Carl Friedrich Gauss in 1795 and is still developing. In statistical analysis there is parametric method"Single-way analysis of variance". Currently, it is used in economics when conducting market research for comparability of results (for example, when conducting surveys about the consumption of a product in different regions of the country, it is necessary to draw conclusions on how much the survey data differ or do not differ from each other; in psychology, when conducting various types of research), when compiling scientific comparison tests, or researching any social groups, and for solving problems in statistics.

Goal of the work. Get acquainted with such a statistical method as one-way analysis of variance, as well as with its implementation on a PC in various programs and compare these programs.

To study the theory of one-way analysis of variance.

To study programs for solving problems for single-factor analysis.

Conduct comparative analysis these programs.

Achievements of the work: The practical part of the work was completely done by the author: selection of programs, selection of tasks, their solution on a PC, after which a comparative analysis was carried out. In the theoretical part, the classification of ANOVA groups was carried out. This work was tested as a report at the student scientific session "Selected questions of higher mathematics and methods of teaching mathematics"

Structure and scope of work. The work consists of introduction, conclusion, content and bibliography, including 4 titles. The total volume of the work is 25 printed pages. The work contains 1 example solved by 2 programs.

The concept of analysis of variance

Often there is a need to investigate the influence of one or more independent variables (factors) on one or more dependent variables (resultant features), such problems can be solved by methods of analysis of variance, authored by R. Fisher.

ANOVA analysis of variance is a set of statistical data processing methods that allow you to analyze the variability of one or more effective features under the influence of controlled factors (independent variables). Here, a factor is understood as a certain value that determines the properties of the object or system under study, i.e. reason for the end result. When conducting an analysis of variance, it is important to choose the right source and object of influence, i.e. identify dependent and independent variables.

Depending on the signs of classification, several classification groups of analysis of variance are distinguished (Table 1).

By the number of factors taken into account: Univariate analysis - the influence of one factor is studied; Multivariate analysis - the simultaneous influence of two or more factors is studied. By the presence of a connection between samples of values: Analysis of unrelated (different) samples - is carried out when there are several groups of research objects located in different conditions. (The null hypothesis H0 is checked: the mean value of the dependent variable is the same in different measurement conditions, i.e. does not depend on the factor under study.); Analysis of related (same) samples - is carried out for two or more measurements taken on the same the same group of studied objects under different conditions. Here, the influence of an unaccounted factor is possible, which can be erroneously attributed to a change in conditions. By the number of dependent variables affected by factors. Univariate analysis (ANOVA or AMCOVA - covariance analysis) - one dependent variable is affected by factors; Multivariate analysis (MANOVA - multivariate analysis of variance or MANCOVA - multivariate covariance analysis) - several dependent variables are affected by factors. According to the purpose of the study. Deterministic - the levels of all factors are fixed in advance and it is their influence that is checked (the hypothesis H0 is checked about the absence of differences between the average levels); Random - the levels of each factor are obtained as a random sample from the general population of factor levels (the hypothesis H0 is being tested that the dispersion of the average response values ​​calculated for different levels of the factor is nonzero);

In one-way analysis of variance, the statistical significance of the differences in the sample means of two or more populations is checked for this, hypotheses are preliminarily formed.

Null hypothesis H0: the average values ​​of the effective feature in all conditions of the factor action (or factor gradations) are the same

Alternative hypothesis H1: the average values ​​of the effective feature in all conditions of the factor are different.

ANOVA methods can be applied to normally distributed populations (multivariate analogs of parametric tests) and to populations that do not have definite distributions (multivariate analogs of nonparametric tests). In the first case, it is necessary to first establish that the distribution of the resulting feature is normal. To check the normality of the distribution of a feature, you can use the asymmetry indicators A = , , and kurtosis E = , , Where , . - the value of the effective feature and its average value; - standard deviation of the resulting feature; .

Number of observations;

Representativeness errors for measures A and E

If the skewness and kurtosis indicators do not exceed their representativeness errors by more than 3 times, i.e. A<3тА и Е <3тЕ, то распределение можно считать нормальным. Для нормальных распределений показатели А и Е равны нулю.

Data relating to one condition of the factor (to one gradation) is called a dispersion complex. When conducting an analysis of variance, the equality of dispersions between complexes should be observed. In this case, the selection of elements should be carried out randomly.

In the second case, when sample populations have arbitrary distributions, non-parametric (rank) analogues of one-way analysis of variance are used (Kruskal-Wallis criteria, Friedman).

Consider a graphical illustration of the dependence of the rate of return on shares on the state of affairs in the country's economy (Fig. 1, a). Here, the factor under study is the level of the state of the economy (more precisely, three levels of its state), and the effective feature is the rate of return. The above distribution shows that this factor has a significant impact on profitability, i.e. As the economy improves, so does the return on stocks, which is not contrary to common sense.

Note that the chosen factor has gradations, i.e. its value changed during the transition from one gradation to another (from one state of the economy to another).

Rice. 1. The ratio of the influence of the factor and intra-group spread: a - significant influence of the factor; b - insignificant influence of the factor

The group of gradations of a factor is only a special case, in addition, a factor can have gradations presented even in a nominal scale. Therefore, more often they speak not about the gradations of a factor, but about the various conditions of its action.

Let us now consider the idea of ​​analysis of variance, which is based on the rule of adding variances: the total variance is equal to the sum of the intergroup and the average of the intragroup variances:

Total variance arising from the influence of all factors

Intergroup dispersion due to the influence of all other factors;

The average intra-group variance caused by the influence of the grouping attribute.

The influence of the grouped trait is clearly seen in Fig. 1a, since the influence of the factor is significant compared to the intragroup scatter, therefore, the intergroup variance will be greater than the intragroup one ( > ), and in Fig. 1, b, the opposite picture is observed: here the intragroup spread prevails and the influence of the factor is practically absent.

The analysis of variance is built on the same principle, only it does not use variances, but the average of squared deviations ( , , ), which are unbiased estimates of the corresponding variances. They are obtained by dividing the sums of squared deviations by the corresponding number of degrees of freedom

Aggregates as a whole;

Intragroup averages;

Intergroup averages;

Overall average for all measurements (for all groups);

Group average for the j-th gradation of the factor.

Mathematical expectations for the intragroup and intergroup sum of squared deviations, respectively, are calculated by the formulas: (Fixed factor model),

.

E ( ) = E ( ) = , then the null hypothesis H0 about the absence of differences between the means is confirmed, therefore, the factor under study does not have a significant effect (see Fig. 1, b). If the actual value of Fisher's F-test F= E ( ) /E ( ) will be greater than the critical then the null hypothesis H0 at the significance level , the alternative hypothesis H1 is rejected and accepted - about the significant impact of the factor fig. 1, a. .

One-way analysis of variance

An analysis of variance that considers only one variable is called One-Way ANOVA.

There is a group of n objects of observation with measured values ​​of some variable under study . per variable is influenced by some quality factor With several levels (gradations) of impact. Measured variable values at different levels of the factor are given in Table 2 (they can also be presented in matrix form).

Table 2.

Tabular form of setting initial data for univariate analysis

Observation object number ()Variable values at the level (gradation) of the factor (lowest) (short)… (highest)1 2 … n .Here, each level can contain a different number of responses measured at one level of the factor, then each column will have its own value . It is required to evaluate the significance of the influence of this factor on the variable under study. To solve this problem, a one-factor model of variance analysis can be used. One-factor dispersion model.

The value of the variable under study for the -th object of observation at -th level of the factor;

Group average for -th level of the factor;

The effect due to the influence of the -th level of the factor;

Random component, or perturbation caused by the influence of uncontrollable factors. So let's highlight the main limitations of using ANOVA:

Equality to zero of the mathematical expectation of a random component: = 0.

Random component , and hence also have a normal distribution.

The number of gradations of factors must be at least three.

This model, depending on the levels of the factor, using the Fisher F-test, allows you to test one of the null hypotheses.

When performing analysis of variance for related samples, it is possible to test another null hypothesis H0(u) - individual differences between the objects of observation are expressed no more than differences due to random reasons.

One-way analysis of variance

(Practical implementation in IBM SPSS Statistics 20)

The researcher is interested in the question of how a certain attribute changes under different conditions of the action of a variable (factor). The effect of only one variable (factor) on the trait under study is studied. We have already considered an example from economics; now we will give an example from psychology, for example, how the time to solve a problem changes under different conditions of motivation of the subjects (low, medium, high motivation) or with different ways of presenting the task (orally, in writing or in the form of text with graphs and illustrations) , in different conditions of working with the task (alone, in a room with a teacher, in a classroom). In the first case, the factor is motivation, in the second - the degree of visibility, in the third - the factor of publicity.

In this version of the method, different samples of subjects are exposed to the influence of each of the gradations. There must be at least three gradations of the factor.

Example 1. Three different groups of six subjects were given lists of ten words. Words were presented to the first group at a low rate of 1 word per 5 seconds, to the second group at an average rate of 1 word per 2 seconds, and to the third group at a high rate of 1 word per second. It was predicted that the performance of reproduction will depend on the speed of presentation of words (Table 3).

Table 3

Number of words reproduced

SubjectGroup 1 low speedGroup 2 medium speedGroup 3 high speed

We formulate hypotheses: differences in the volume of word reproduction between groups are no more pronounced than random differences within each group: Differences in word reproduction between groups are more pronounced than random differences within each group.

We will carry out the solution in the SPSS environment according to the following algorithm

Let's run the SPSS program

Enter numerical values ​​in the window data

Rice. 1. Entering values ​​in SPSS

In the window Variables we describe all the initial data, according to the condition

Tasks

Figure 2 Variables window

For clarity, in the label column, we describe the name of the tables

In the graph Values describe the number of each group

Figure 3 Value Labels

All this is done for clarity, i.e. these settings can be ignored.

In the graph scale , in the second column you need to put the value of the nominal

In the window data order a one-way analysis of variance using the "Analysis" menu Average Comparison

One-way analysis of variance…

Figure 4 One-Way ANOVA Function

In the opened dialog box One-way analysis of variance select the dependent variable and add it to list of dependents , and the variable factor in the window factor

Figure 5 highlighting the list of dependents and the factor

Set up some parameters for high-quality data output

Figure 6 Parameters for qualitative data inference

Calculations for the selected one-way ANOVA algorithm starts after clicking OK

At the end of the calculations, the results of the calculation are displayed in the viewing window.

Descriptive StatisticsGroup NAverage Std. Deviation Std. Error95% confidence interval for mean Minimum Maximum 27.4544.826.7429 Table 2. Descriptive statistics

The table Descriptive statistics shows the main indicators for speeds in groups and their total values.

The number of observations in each group and the total

Mean - arithmetic mean of observations in each group and for all groups together

Std. Deviation, Std. Error - standard deviation and standard deviations

% confidence interval for the mean - these intervals are more accurate for each group and for all groups together, rather than taking intervals below or above these limits.

Minimum, Maximum - the minimum and maximum values ​​for each group that the subjects heard

single-factor variance random

Criterion for homogeneity of variancesgroup Statistics Livinast.st.1st.st.

Livin's homogeneity test is used to test dispersions for homogeneity (homogeneity). In this case, it confirms the insignificance of the differences between the variances, since the value = 0.915, i.e., clearly greater than 0.05. Therefore, the results obtained using the analysis of variance are recognized as correct.

Table 1-way analysis of variance shows the results of 1-way DA

The "between groups" sum of squares is the sum of the squares of the differences between the overall mean and the means in each group, weighted by the number of objects in the group

"Within groups" is the sum of the squared differences between the mean of each group and each value of that group

Column "St. St." contains the number of degrees of freedom V:

Intergroup (v=number of groups - 1);

Intragroup (v=number of objects - number of groups - 1);

"mean square" contains the ratio of the sum of squares to the number of degrees of freedom.

Column "F" shows the ratio of the mean square between groups to the mean square within groups.

The "value" column contains the probability value that the observed differences are random.

Table 4 Formulas

Graphs of averages

The graph shows that it is decreasing. It is also possible to determine from the table Fk k1=2, k2=15 the tabular value of statistics is 3.68. By rule, if , then the null hypothesis is accepted, otherwise the alternative hypothesis is accepted. For our example (7.45>3.68), hence the alternative hypothesis is accepted. Thus, returning to the condition of the problem, we can conclude the null hypothesis rejected and an alternative accepted. : differences in word volume between groups are more pronounced than random differences within each group ). That. the speed of presentation of words affects the volume of their reproduction.

One-way analysis of variance

(Practical implementation in Microsoft Office 2013)

In the same example, consider one-way analysis of variance in Microsoft Office 2013

Solving a problem in Microsoft Excel

Let's open Microsoft Excel.


Figure 1. Writing data to Excel

Let's convert the data to a number format. To do this, on the main tab there is an item Format and it has a subparagraph Cell Format . The Format Cells window appears on the screen. Rice. 2 Select Number format and the entered data is converted. As shown in Fig.3

Figure 2 Convert to Numeric Format

Figure 3 Result after conversion

On the data tab there is an item data analysis let's click on it.

Let's choose One-way analysis of variance

Figure 6 Data analysis

The One-way analysis of variance window will appear on the screen for conducting dispersion analysis of data (Fig. 7). Let's configure the parameters

Rice. 7 Setting parameters for univariate analysis

Click the mouse in the Input interval field. Select the range of cells B2::F9, the data in which you want to analyze. In the Input Spacing field of the Inputs control group, the specified range appears.

If the row by row switch is not set in the Input data control group, then select it so that the Excel program accepts data groups by row.

Optional Select the Labels in First Row check box in the Inputs controls group if the first column of the selected data range contains row names.

In the Alpha input field of the Input data control group, by default, the value 0.05 is displayed, which is associated with the probability of an error in the analysis of variance.

If the output interval switch is not set in the Output Parameters group of controls, then set it or select the new worksheet switch so that the data is transferred to a new sheet.

Click the OK button to close the One-Way ANOVA window. The results of the analysis of variance will appear (Fig. 8).

Figure 8 Data output

The range of cells A4:E7 contains the results of descriptive statistics. Line 4 contains the names of the parameters, lines 5 - 7 - statistical values ​​calculated by batches. The "Score" column contains the number of measurements, the "Sum" column contains the sums of values, the "Average" column contains the arithmetic mean values, and the "Dispersion" column contains dispersions.

The results obtained show that the highest average breaking load in batch No. 1, and largest variance breaking load - in batches No. 2, No. 1.

The range of cells A10:G15 displays information regarding the significance of discrepancies between data groups. Line 11 contains the names of the analysis of variance parameters, line 12 - the results of intergroup processing, line 13 - the results of intragroup processing, and line 15 - the sum of the values ​​of these two lines.

The SS column contains the variation values, i.e. sums of squares over all deviations. Variation, like dispersion, characterizes the spread of data.

The df column contains the values ​​of the numbers of degrees of freedom. These numbers indicate the number of independent deviations over which the variance will be calculated. For example, the intergroup number of degrees of freedom is equal to the difference between the number of data groups and one. The greater the number of degrees of freedom, the higher the reliability of the dispersion parameters. The degrees of freedom data in the table show that the within-group results are more reliable than the between-group parameters.

The MS column contains the dispersion values, which are determined by the ratio of variation and the number of degrees of freedom. Dispersion characterizes the degree of scatter of data, but unlike the magnitude of variation, it does not have a direct tendency to increase with an increase in the number of degrees of freedom. The table shows that the intergroup variance is much larger than the intragroup variance.

Column F contains the value of the F-statistic, calculated by the ratio of the intergroup and intragroup variances.

The F-critical column contains the F-critical value calculated from the number of degrees of freedom and the value of Alpha. F-statistic and F-critical value use the Fisher-Snedekor test.

If the F-statistic is greater than the F-critical value, then it can be argued that the differences between data groups are not random. those. at the level of significance α = 0 .05 (with a reliability of 0.95), the null hypothesis is rejected and the alternative is accepted: that the speed of presentation of words affects the volume of their reproduction. The P-value column contains the probability that the difference between groups is random. Since this probability is very small in the table, the deviation between groups is not random.

Comparison of IBM SPSS Statistics 20 and Microsoft Office 2013

one-factor variance random program

Let's look at the outputs of the programs, for this we will look again at the screenshots.

One-way analysis of variance group Sum of Squares St.Lm Mean Square FZn Between groups31.444215.7227.447.006 Within groups31.667152.111Total63.11117

Thus, the IBM SPSS Statistics 20 program produces a better score, can round numbers, build a visual graph (see full solution) by which you can determine the answer, it describes in more detail both the conditions of the problem and their solution. Microsoft Office 2013 has its advantages, firstly, of course, its prevalence, since Microsoft Office 2013 is installed in almost every computer, it displays Fcritical, which is not provided in SPSS Statistics, and it is also simple and convenient to calculate there. Still, both of these programs are very well suited for solving problems for one-way ANOVA, each of them has its pros and cons, but if you consider large problems with large conditions, I would recommend SPSS Statistics.

Conclusion

Analysis of variance is applied in all areas scientific research, where it is necessary to analyze the influence various factors to the variable under study. IN modern world There are many tasks for single-factor analysis of variance in economics, psychology, and biology. As a result of studying the theoretical material, it was found that the basis of variance analysis is the theorem on the addition of variances, from the many software packages in which the apparatus of variance analysis is implemented, the best ones were selected and included in the work. Thanks to the advent of new technologies, each of us can conduct research (decisions), while spending less time and effort on calculations, using computers. In the process of work, goals were set, tasks that were achieved.

list of literature

Sidorenko, E.V. Methods of mathematical processing in psychology [Text] / St. Petersburg. 2011. - 256 p.

Mathematical statistics for psychologists Ermolaev O.Yu [Text] / Moscow_2009 -336s

Lecture 7. Analytical statistics [Electronic resource]. , Access date: 05/14/14

Probability theory and mathematical statistics [Text] / Gmurman V.E. 2010 -479s

In this topic, only one-way analysis of variance, used for unrelated samples, will be considered. In terms of the basic concept of variance, this analysis is based on the calculation of variances of three types:

The total variance calculated for the entire set of experimental data;

Intragroup variance characterizing the variability of a trait in each sample;

Intergroup dispersion characterizing the variability of group means.

The main position of the analysis of variance says: the total variance is equal to the sum of the intragroup and intergroup variances.

This position can be written as an equation:

Where x ij- values ​​of all variables obtained in the experiment; while the index j varies from 1 before R, Where R- the number of compared samples, there may be three or more; index i corresponds to the number of elements in the sample (there may be two or more);

The overall average of the entire analyzed data set;

Medium j samples;

N- the total number of all elements in the analyzed set of experimental data;

R- number of experimental samples.

Let's analyze this equation in more detail.

Let us have R groups (samples). In ANOVA, each sample is represented as a single column (or row) of numbers. Then, in order to be able to point to a specific group (sample), an index is introduced j, which changes accordingly from j= 1 to j= r. For example, if we have 5 groups (samples), then p=5, and the index j changes accordingly from j= 1 to j= 5.

Let us face the task of specifying a specific element (measurement value) of a sample. To do this, we must know the number of this sample, for example 4, and the location of the element (measured value) in this sample. This element can be located in the selection from the first value (first row) to the last (last row). Let our required element be located on the fifth line. Then its notation will be: x 54 . This means that the fifth element in the row from the fourth sample is selected.

In the general case, in each group (sample), the number of its constituent elements can be different - therefore, we denote the number of elements in j group (sample) through nj. The values ​​of the feature obtained in the experiment in j group denoted by xij, Where i= 1, 2, ... n is the serial number of the observation in j group.

It is advisable to carry out further reasoning based on table 35. Note, however, that for the convenience of further reasoning, the samples in this table are presented not as columns, but as rows (which, however, is not important).

In the final, last row of the table, the total volume of the entire sample is given - N, the sum of all the obtained values ​​of G and the total average of the entire sample. This overall average is obtained as the sum of all elements of the analyzed set of experimental data, denoted above as G, divided by the number of all elements N.


The rightmost column of the table shows the mean values ​​for all samples. For example, in j sample (line of the table denoted by the symbol j) the value of the average (for the entire j sample) is as follows:

Analysis of variance is a statistical method for assessing the relationship between factor and performance characteristics in various groups, selected randomly, based on the determination of differences (diversity) of feature values. The analysis of variance is based on the analysis of deviations of all units of the studied population from the arithmetic mean. As a measure of deviations, dispersion (B) is taken - the average square of deviations. Deviations caused by the influence of a factor attribute (factor) are compared with the magnitude of deviations caused by random circumstances. If the deviations caused by the factor attribute are more significant than random deviations, then the factor is considered to have a significant impact on the resulting attribute.

In order to calculate the variance of the deviation value of each option (each registered numerical value of the attribute) from the arithmetic mean, squared. This will get rid of negative signs. Then these deviations (differences) are summed up and divided by the number of observations, i.e. average out deviations. Thus, the dispersion values ​​are obtained.

An important methodological value for the application of analysis of variance is the correct formation of the sample. Depending on the goal and objectives, selective groups can be randomly formed independently of each other (control and experimental groups to study some indicator, for example, the effect of high blood pressure on the development of stroke). Such samples are called independent.

Often, the results of exposure to factors are studied in the same sample group (for example, in the same patients) before and after exposure (treatment, prevention, rehabilitation measures), such samples are called dependent.

Analysis of variance, in which the influence of one factor is checked, is called one-factor analysis (univariate analysis). When studying the influence of more than one factor, multivariate analysis of variance (multivariate analysis) is used.

Factor signs are those signs that affect the phenomenon under study.

Effective signs are those signs that change under the influence of factor signs.

Conditions for the use of analysis of variance:

The task of the study is to determine the strength of the influence of one (up to 3) factors on the result or to determine the strength of the combined influence of various factors (gender and age, physical activity and nutrition, etc.).

The studied factors should be independent (unrelated) to each other. For example, one cannot study the combined effect of work experience and age, height and weight of children, etc. on the incidence of the population.

The selection of groups for the study is carried out randomly (random selection). The organization of a dispersion complex with the implementation of the principle of random selection of options is called randomization (translated from English - random), i.e. chosen at random.

Both quantitative and qualitative (attributive) features can be used.

When conducting a one-way analysis of variance, it is recommended ( necessary condition applications):

1. The normality of the distribution of the analyzed groups or the correspondence of the sample groups to general populations with a normal distribution.

2. Independence (non-connectedness) of the distribution of observations in groups.

3. Presence of frequency (recurrence) of observations.

First, a null hypothesis is formulated, that is, it is assumed that the factors under study do not have any effect on the values ​​of the resulting attribute and the resulting differences are random.

Then we determine what is the probability of obtaining the observed (or stronger) differences, provided that the null hypothesis is true.

If this probability is small, then we reject the null hypothesis and conclude that the results of the study are statistically significant. This does not yet mean that the effect of the studied factors has been proven (this is primarily a matter of research planning), but it is still unlikely that the result is due to chance.

When all the conditions for applying the analysis of variance are met, the decomposition of the total variance mathematically looks like this:

Dotot. = Dfact + D rest.,

Dotot. - the total variance of the observed values ​​(variant), characterized by the spread of the variant from the total average. Measures the variation of a trait in the entire population under the influence of all the factors that caused this variation. The overall diversity is made up of intergroup and intragroup;

Dfact - factorial (intergroup) dispersion, characterized by the difference in the averages in each group and depends on the influence of the studied factor, according to which each group is differentiated. For example, in groups of different etiological factors of the clinical course of pneumonia, the average level of the spent bed-day is not the same - intergroup diversity is observed.

D rest. - residual (intragroup) variance, which characterizes the dispersion of the variant within the groups. Reflects random variation, i.e. part of the variation that occurs under the influence of unspecified factors and does not depend on the trait - the factor underlying the grouping. The variation of the trait under study depends on the strength of the influence of some unaccounted random factors, both on organized (given by the researcher) and random (unknown) factors.

Therefore, the total variation (dispersion) is composed of the variation caused by organized (given) factors, called factorial variation and unorganized factors, i.e. residual variation (random, unknown).

For a sample size of n, the sample variance is calculated as the sum of the squared deviations from the sample mean divided by n-1 (sample size minus one). Thus, with a fixed sample size n, the variance is a function of the sum of squares (deviations), denoted, for brevity, SS (from the English Sum of Squares - Sum of Squares). In what follows, we often omit the word "selective", knowing full well that we are considering a sample variance or an estimate of the variance. The analysis of variance is based on the division of the variance into parts or components. Consider the following data set:

The means of the two groups are significantly different (2 and 6, respectively). The sum of the squared deviations within each group is 2. Adding them together, we get 4. If we now repeat these calculations without taking into account group membership, that is, if we calculate SS based on the total average of these two samples, we get a value of 28. In other words, the variance (sum squares) based on within-group variability results in much lower values ​​than those calculated based on total variability (relative to the overall mean). The reason for this is obviously the significant difference between the means, and this difference between the means explains the existing difference between the sums of squares.

SS St. St. MS F p
Effect 24.0 24.0 24.0 .008
Error 4.0 1.0

As can be seen from the table, the total sum of squares SS = 28 is divided into components: the sum of squares due to within-group variability (2+2=4; see the second row of the table) and the sum of squares due to the difference in the means between groups (28-(2+ 2)=24; see the first line of the table). Note that MS in this table is the mean square equal to SS divided by the number of degrees of freedom (stdf).

In the simple example above, you could immediately calculate the t-test for independent samples. The results obtained, of course, coincide with the results of the analysis of variance.

However, situations where some phenomenon is completely described by one variable are extremely rare. For example, if we are trying to learn how to grow large tomatoes, we should consider factors related to the genetic structure of plants, soil type, light, temperature, etc. Thus, when conducting a typical experiment, you have to deal with a large number of factors. The main reason why using ANOVA is preferable to re-comparing two samples at different factor levels using t-test series is that ANOVA is significantly more efficient and, for small samples, more informative.

Suppose that in the two-sample analysis example discussed above, we add another factor, such as Gender. Let each group now consist of 3 men and 3 women. The plan of this experiment can be presented in the form of a table:

Before doing the calculations, you can see that in this example, the total variance has at least three sources:

1) random error (intragroup variance),

2) variability associated with belonging to the experimental group

3) variability due to the sex of the objects of observation.

Note that there is another possible source of variability - the interaction of factors, which we will discuss later). What happens if we don't include gender as a factor in our analysis and calculate the usual t-test? If we calculate sums of squares ignoring gender (i.e. combining objects of different sexes into one group when calculating within-group variance and thus obtaining the sum of squares for each group equal to SS = 10 and the total sum of squares SS = 10+10 = 20) , then we get a larger value intragroup variance than in a more accurate analysis with additional subgrouping by sex (with within-group means equal to 2, and the total within-group sum of squares is equal to SS = 2+2+2+2 = 8).

So, with the introduction of an additional factor: gender, the residual variance decreased. This is because the male mean is smaller than the female mean, and this difference in means increases the overall within-group variability if gender is not taken into account. Controlling the error variance increases the sensitivity (power) of the test.

This example shows another advantage of the analysis of variance compared to the usual two-sample t-test. Analysis of variance allows you to study each factor by controlling the values ​​of other factors. This, in fact, is the main reason for its greater statistical power (smaller sample sizes are required to obtain meaningful results). For this reason, analysis of variance, even on small samples, gives statistically more significant results than a simple t-test.

Analysis of variance is a set of statistical methods designed to test hypotheses about the relationship between certain features and the studied factors that do not have a quantitative description, as well as to establish the degree of influence of factors and their interaction. In specialized literature, it is often called ANOVA (from the English name Analysis of Variations). This method was first developed by R. Fischer in 1925.

Types and criteria for analysis of variance

This method is used to investigate the relationship between qualitative (nominal) features and a quantitative (continuous) variable. In fact, it tests the hypothesis about the equality of the arithmetic means of several samples. Thus, it can be considered as a parametric criterion for comparing the centers of several samples at once. If you use this method for two samples, then the results of the analysis of variance will be identical to the results of the Student's t-test. However, unlike other criteria, this study allows you to study the problem in more detail.

Analysis of variance in statistics is based on the law: the sum of the squared deviations of the combined sample is equal to the sum of the squares of the intragroup deviations and the sum of the squares of the intergroup deviations. For the study, Fisher's test is used to establish the significance of the difference intergroup variances from within the group. However, for this, the necessary prerequisites are the normality of the distribution and the homoscedasticity (equality of variances) of the samples. Distinguish between one-dimensional (single-factor) analysis of variance and multivariate (multifactorial). The first considers the dependence of the value under study on one attribute, the second - on many at once, and also allows you to identify the relationship between them.

Factors

Factors are called controlled circumstances that affect the final result. Its level or method of processing is called the value that characterizes the specific manifestation of this condition. These figures are usually given in a nominal or ordinal scale of measurement. Often output values ​​are measured on quantitative or ordinal scales. Then there is the problem of grouping the output data in a series of observations that correspond to approximately the same numerical values. If the number of groups is too large, then the number of observations in them may be insufficient to obtain reliable results. If the number is taken too small, this can lead to the loss of essential features of influence on the system. The specific method of grouping data depends on the volume and nature of the variation in values. The number and size of intervals in univariate analysis are most often determined by the principle of equal intervals or by the principle of equal frequencies.

Tasks of dispersion analysis

So, there are cases when you need to compare two or more samples. It is then that it is advisable to use the analysis of variance. The name of the method indicates that the conclusions are made on the basis of the study of the components of the variance. The essence of the study is that the overall change in the indicator is divided into components that correspond to the action of each individual factor. Consider a number of problems that a typical analysis of variance solves.

Example 1

The workshop has a number of machine tools - automatic machines that produce a specific part. The size of each part is a random value, which depends on the settings of each machine and random deviations that occur during the manufacturing process of the parts. It is necessary to determine from the measurements of the dimensions of the parts whether the machines are set up in the same way.

Example 2

During the manufacture of an electrical apparatus, various types of insulating paper are used: capacitor, electrical, etc. The apparatus can be impregnated with various substances: epoxy resin, varnish, ML-2 resin, etc. Leaks can be eliminated under vacuum at high blood pressure, when heated. It can be impregnated by immersion in varnish, under a continuous stream of varnish, etc. The electrical apparatus as a whole is poured with a certain compound, of which there are several options. Quality indicators are the dielectric strength of the insulation, the overheating temperature of the winding in operating mode, and a number of others. During the development of the technological process of manufacturing devices, it is necessary to determine how each of the listed factors affects the performance of the device.

Example 3

The trolleybus depot serves several trolleybus routes. They operate trolleybuses of various types, and 125 inspectors collect fares. The management of the depot is interested in the question: how to compare the economic performance of each controller (revenue) given the different routes, different types of trolleybuses? How to determine the economic feasibility of launching trolleybuses of a certain type on a particular route? How to establish reasonable requirements for the amount of revenue that the conductor brings on each route in various types of trolleybuses?

The task of choosing a method is how to obtain maximum information regarding the impact on the final result of each factor, determine the numerical characteristics of such an impact, their reliability at minimal cost and in the shortest possible time. Methods of dispersion analysis allow to solve such problems.

Univariate analysis

The study aims to assess the magnitude of the impact of a particular case on the review being analyzed. Another task of univariate analysis may be to compare two or more circumstances with each other in order to determine the difference in their influence on the recall. If the null hypothesis is rejected, then the next step is to quantify and build confidence intervals for the obtained characteristics. In the case when the null hypothesis cannot be rejected, it is usually accepted and a conclusion is made about the nature of the influence.

One-way analysis of variance can become a non-parametric analogue rank method Kruskal-Wallis. It was developed by the American mathematician William Kruskal and economist Wilson Wallis in 1952. This test is intended to test the null hypothesis that the effects of influence on the studied samples are equal with unknown but equal mean values. In this case, the number of samples must be more than two.

The Jonkhier (Jonkhier-Terpstra) criterion was proposed independently by the Dutch mathematician T. J. Terpstrom in 1952 and the British psychologist E. R. Jonkhier in 1954. It is used when it is known in advance that the available groups of results are ordered by an increase in the influence of the factor under study, which is measured on an ordinal scale.

M - the Bartlett criterion, proposed by the British statistician Maurice Stevenson Bartlett in 1937, is used to test the null hypothesis about the equality of the variances of several normal populations, from which the studied samples are taken, in the general case, having different volumes (the number of each sample must be at least four).

G is the Cochran test, which was discovered by the American William Gemmel Cochran in 1941. It is used to test the null hypothesis about the equality of the variances of normal populations for independent samples of equal size.

The nonparametric Levene test, proposed by the American mathematician Howard Levene in 1960, is an alternative to the Bartlett test in conditions where there is no certainty that the studied samples obey normal distribution.

In 1974, American statisticians Morton B. Brown and Alan B. Forsythe proposed a test (the Brown-Forsyth test), which is somewhat different from the Levene test.

Two-way analysis

Two-way analysis of variance is used for linked normally distributed samples. In practice, complex tables of this method are often used, in particular, those in which each cell contains a set of data (repeated measurements) corresponding to fixed level values. If the assumptions necessary to apply the two-way analysis of variance are not met, then the non-parametric Friedman rank test (Friedman, Kendall and Smith), developed by the American economist Milton Friedman at the end of 1930, is used. This criterion does not depend on the type of distribution.

It is only assumed that the distribution of quantities is the same and continuous, and that they themselves are independent of each other. When testing the null hypothesis, the output data is presented in the form of a rectangular matrix, in which the rows correspond to the levels of factor B, and the columns correspond to levels A. Each cell of the table (block) can be the result of measurements of parameters on one object or on a group of objects with constant values ​​of the levels of both factors . In this case, the corresponding data are presented as the average values ​​of a certain parameter for all measurements or objects of the sample under study. To apply the output criterion, it is necessary to move from the direct results of measurements to their rank. The ranking is carried out for each row separately, that is, the values ​​are ordered for each fixed value.

The Page test (L-test), proposed by the American statistician E. B. Page in 1963, is designed to test the null hypothesis. For large samples, the Page approximation is used. They, subject to the reality of the corresponding null hypotheses, obey the standard normal distribution. In the case when the rows of the source table have the same values, it is necessary to use the average ranks. In this case, the accuracy of the conclusions will be the worse, the greater the number of such coincidences.

Q - Cochran's criterion, proposed by V. Cochran in 1937. It is used in cases where groups of homogeneous subjects are exposed to more than two influences and for which two response options are possible - conditionally negative (0) and conditionally positive (1) . The null hypothesis consists of equality of influence effects. Two-way analysis of variance makes it possible to determine the existence of processing effects, but does not make it possible to determine for which columns this effect exists. To solve this problem, the method multiple equations Scheffe for linked samples.

Multivariate analysis

The problem of multivariate analysis of variance arises when it is necessary to determine the influence of two or more conditions on a certain random variable. The study provides for the presence of one dependent random variable, measured in a scale of difference or ratios, and several independent values, each of which is expressed in a scale of names or in a rank. Dispersion analysis of data is a fairly developed branch of mathematical statistics, which has a lot of options. The concept of the study is common for both univariate and multivariate studies. Its essence is that total variance divided into components, which corresponds to a certain grouping of data. Each grouping of data has its own model. Here we will consider only the main provisions necessary for understanding and practical use the most used options.

Factor analysis of variance requires careful attention to the collection and presentation of input data, and especially to the interpretation of the results. In contrast to the one-factor, the results of which can be conditionally placed in a certain sequence, the results of the two-factor require a more complex presentation. An even more difficult situation arises when there are three, four or more circumstances. Because of this, the model rarely includes more than three (four) conditions. An example would be the occurrence of resonance at a certain value of capacitance and inductance of the electric circle; manifestation chemical reaction with a certain set of elements from which the system is built; the occurrence of anomalous effects in complex systems under a certain coincidence of circumstances. The presence of interaction can radically change the model of the system and sometimes lead to a rethinking of the nature of the phenomena with which the experimenter is dealing.

Multivariate analysis of variance with repeated experiments

Measurement data can often be grouped not by two, but by more factors. So, if we consider the dispersion analysis of the service life of tires for trolleybus wheels, taking into account the circumstances (manufacturer and the route on which tires are used), then we can single out as a separate condition the season during which tires are used (namely: winter and summer operation). As a result, we will have the problem of the three-factor method.

In the presence of more conditions, the approach is the same as in two-way analysis. In all cases, the model is trying to simplify. The phenomenon of the interaction of two factors does not appear so often, and the triple interaction occurs only in exceptional cases. Include those interactions for which there is previous information and good reasons to take it into account in the model. The process of isolating individual factors and taking them into account is relatively simple. Therefore, there is often a desire to highlight more circumstances. You shouldn't get carried away with this. How more conditions, the less reliable the model becomes and the greater the probability of error. The model itself, which includes a large number of independent variables, becomes quite difficult to interpret and inconvenient for practical use.

General idea of ​​analysis of variance

Analysis of variance in statistics is a method of obtaining observational results that depend on various concurrent circumstances and assessing their influence. A controlled variable that corresponds to the method of influence on the object of study and acquires a certain value in a certain period of time is called a factor. They can be qualitative and quantitative. Levels of quantitative conditions acquire a certain value on a numerical scale. Examples are temperature, pressing pressure, amount of substance. Qualitative factors are different substances, different technological methods, apparatuses, fillers. Their levels correspond to the scale of names.

The quality also includes the type of packaging material, the storage conditions of the dosage form. It is also rational to include the degree of grinding of raw materials, the fractional composition of granules, which have a quantitative value, but are difficult to regulate, if a quantitative scale is used. The number of quality factors depends on the type of dosage form, as well as the physical and technological properties of medicinal substances. For example, tablets can be obtained from crystalline substances by direct compression. In this case, it is sufficient to carry out the selection of sliding and lubricating agents.

Examples of quality factors for different types of dosage forms

  • Tinctures. Extractant composition, type of extractor, raw material preparation method, production method, filtration method.
  • Extracts (liquid, thick, dry). The composition of the extractant, the extraction method, the type of installation, the method of removing the extractant and ballast substances.
  • Pills. Composition of excipients, fillers, disintegrants, binders, lubricants and lubricants. The method of obtaining tablets, the type of technological equipment. Type of shell and its components, film formers, pigments, dyes, plasticizers, solvents.
  • injection solutions. Type of solvent, filtration method, nature of stabilizers and preservatives, sterilization conditions, method of filling ampoules.
  • Suppositories. The composition of the suppository base, the method of obtaining suppositories, fillers, packaging.
  • Ointments. The composition of the base, structural components, method of preparation of the ointment, type of equipment, packaging.
  • Capsules. Type of shell material, method of obtaining capsules, type of plasticizer, preservative, dye.
  • Liniments. Production method, composition, type of equipment, type of emulsifier.
  • Suspensions. Type of solvent, type of stabilizer, dispersion method.

Examples of quality factors and their levels studied in the tablet manufacturing process

  • Baking powder. Potato starch, white clay, a mixture of sodium bicarbonate with citric acid, basic magnesium carbonate.
  • binding solution. Water, starch paste, sugar syrup, methylcellulose solution, hydroxypropyl methylcellulose solution, polyvinylpyrrolidone solution, polyvinyl alcohol solution.
  • sliding substance. Aerosil, starch, talc.
  • Filler. Sugar, glucose, lactose, sodium chloride, calcium phosphate.
  • Lubricant. Stearic acid, polyethylene glycol, paraffin.

Models of dispersion analysis in the study of the level of competitiveness of the state

One of the most important criteria for assessing the state of the state, which is used to assess the level of its welfare and socio-economic development, is competitiveness, that is, a set of properties inherent in national economy, which determine the ability of the state to compete with other countries. Having determined the place and role of the state in the world market, it is possible to establish a clear strategy for ensuring economic security on an international scale, because it is the key to positive relations between Russia and all players in the world market: investors, creditors, state governments.

To compare the level of competitiveness of states, countries are ranked using complex indices, which include various weighted indicators. These indices are based on key factors that affect the economic, political, etc. situation. The complex of models for studying the competitiveness of the state provides for the use of methods of multivariate statistical analysis (in particular, this is an analysis of variance (statistics), econometric modeling, decision making) and includes the following main stages:

  1. Formation of a system of indicators-indicators.
  2. Evaluation and forecasting of indicators of the competitiveness of the state.
  3. Comparison of indicators-indicators of competitiveness of states.

And now let's consider the content of the models of each of the stages of this complex.

At the first stage with the help of expert study methods, a reasonable set of economic indicators-indicators for assessing the competitiveness of the state is formed, taking into account the specifics of its development on the basis of international ratings and data from statistical departments, reflecting the state of the system as a whole and its processes. The choice of these indicators is justified by the need to select those that most fully, from the point of view of practice, allow to determine the level of the state, its investment attractiveness and the possibility of relative localization of existing potential and actual threats.

The main indicators-indicators of international rating systems are indices:

  1. Global Competitiveness (GCC).
  2. Economic freedom (IES).
  3. Human Development (HDI).
  4. Perceptions of Corruption (CPI).
  5. Internal and external threats (IVZZ).
  6. Potential for International Influence (IPIP).

Second phase provides for the assessment and forecasting of indicators of the competitiveness of the state according to international ratings for the studied 139 states of the world.

Third stage provides for a comparison of the conditions for the competitiveness of states using the methods of correlation and regression analysis.

Using the results of the study, it is possible to determine the nature of the processes in general and for individual components of the competitiveness of the state; test the hypothesis about the influence of factors and their relationship at the appropriate level of significance.

The implementation of the proposed set of models will allow not only to assess the current situation of the level of competitiveness and investment attractiveness states, but also to analyze the shortcomings of management, to prevent the mistakes of wrong decisions, to prevent the development of a crisis in the state.



error: Content is protected!!