How to determine the optimal sample size for a mass survey. Survey sample size

After the research method is determined and the tool is developed, the research parameters are determined: the type, composition and properties of the sample and its volume. To determine the type of sample, you must use the tables in the lectures: determine the volume and properties population, then choose a sampling model..

The sample size table allows you to determine the sample size based on a predetermined reliability indicator P and a predetermined acceptable error value e. P shows what part of the population the sample can cover as much as possible (this shows its reliability), and the error shows what minimum discrepancies will be allowed between the properties of the general population and the properties of the sample.

Sample size table
e P 0,10 0,09 0,03 0,07 0,06 0,05 0,04 0,03 0,02 0,01
0,75
0,80
0,85
0,90
0,91
0,92
0,93
0,94
0,95
0,96
0,965
0,970
0,975
0,980
0,985
0,990
0,991
0,992
0,993
0,994
0,995
0,996
0,997
0,998
0,999


Let's say we want to cover the population with a reliability of at least 80% and we allow an error of at least 10% in our study. At the same time, we do not know anything about what values ​​the variable we are studying can take, that is, we do not have any a priori information about the general population: we do not know the mean, nor the possible variance - nothing. Then we simply look for the corresponding intersection in the table (P = 0.80, e = 0.10): the sample size will be 41 people. The table is compiled from the calculation of the maximum value of the variance of the dichotomous variable. It can be seen that with an increase in the accuracy of the sample, its volume is growing rapidly - if in the described case we saw a volume of 41 people, then for parameters in P = 95% and e = 5% (standard for most studies) the volume will already be 384 people. Therefore, the table should be used in cases where the general population is relatively small and significant errors are permissible.

To ensure a small sample size for a relatively large population, it is necessary to know in advance the distribution parameters of the variable under study: the mean and variance. In this case, you can use the nomogram below to calculate the samples (the nomogram was built for reliability P = 95%, which is quite enough). To use a nomogram, you need to know two quantities: coefficient of variability v and allowable error e. The coefficient of variability is defined as the coefficient of variation

that is, to determine it, you need to know the arithmetic mean and standard deviation of the variable under study.

To simplify the calculation of the coefficient of variability, it is necessary to know the range of variation, that is, the maximum and minimum values ​​that the variable under study can reach. In this case, the calculation v is done like this:

,where Xmax, Xmin are the maximum and minimum values ​​of the variable under study, AND is a constant real positive number (usually chosen between 5 and 6).


Example 1. Suppose we know that the coefficient of variability of the variable under study is 6%. Find the sample size with an allowable error of 5%. To do this, on the left scale of the nomogram, marked v%, we are looking for point 6. On the right scale of the nomogram, indicated ε% , we are looking for the selected error value, which is 5%. We mark these points on the lines and connect them along the ruler with a straight line. We look at where this line intersects the central scale, indicated n 1. This intersection takes place at point 6. Therefore, the sample size will be 6 people.

Example 2. Let us know that the coefficient of variability of the variable under study is 16%. Find the sample size for a given error of 5%. 16% more than 10% maximum marked on the scale v%, and the scales are logarithmic, so we divide 16 by 10 and on the scale v% nomograms looking for a point 1.6. On the right scale of the nomogram ε% we are looking for the selected error value, which is 5%. We mark these points on the scales and connect them along the ruler with a straight line. See where the line crosses the central scale n 1. The intersection takes place at the point 0.4. Since we reduced 16% to 1.6%, that is, 10 times, we multiply 0.4 by 100. The sample size will be 40 people (compare with the above sample of 384 people for P = 95% and e = 5% without taking into account a specific value of the variance).

Example 3. The consumption of cigarettes by students is studied, and only those who smoke cigarettes are studied (general population - smokers). The allowable error is 5%. It is known in advance (for example, the data is taken from sources of secondary marketing information) that students smoke cigarettes in the amount from one pack of cigarettes every three days to two packs per day, and on average, one pack of cigarettes per day is enough for a student who smokes. Then the corresponding values ​​will be Xmax=2, Xmin\u003d 0.33, and the average will be 1. The coefficient of variability v will be

and on the left scale we set aside 2.8%, on the right 5%, we combine them and on the central scale of the nomogram we get a mark of 1.2 - this means that the sample size should be 120 people.

Example 4. Let's assume that using the previous example, there is no access to the target representative group (smokers). This means that both smokers and non-smokers must be included in the sample. In this case, the parameters for the calculation will be Xmax=2, Xmin=0. What will be the average? The calculation of the average according to the expression (2+0)/2=1 is not correct, since the previous average was calculated only for smokers, and now the ratio of the sizes of groups of smokers and non-smokers is not taken into account. For example, if the proportion of non-smokers is 60% and the proportion of smokers is 40%, then the average would be 0.4.

Let's compare possible sample sizes and research errors:

If there is no data on the ratio of representative and non-representative groups in the general population, then the calculation of the coefficient of variability is carried out through a change in the value AND. As a rule, if the average is calculated by the expression ( Xmax+Xmin)/2, then AND reduced to 5 or less.

As you can see, simple random sampling requires significant volumes to achieve the required accuracy. The total sample size can be significantly reduced in two ways:

1) performing zoning or stratification, that is, highlighting qualitatively various groups in the general population and placing the sample precisely among the representatives of these groups;

2) performing selection of nests, that is, dividing the general population into a large number of identical parts and distributing the sample between these parts.

When conducting a stratified sample, you can proceed as follows (see the diagram below).

Initially, it is determined how much a priori information is known about the general population. For a properly executed stratified sample of the minimum size, it is necessary to know the total size of the population N, the number of strata studied i, the number of each stratum N i, and within each stratum the corresponding mean value of the variable under study and its variance. If all these parameters are known, then using the nomogram discussed above, it is possible to calculate the size of the stratified proportional sample.

To do this, first determine the general variance of the variable under study as the sum of intragroup and intergroup variances, then determine the general average of the average strata, then determine the coefficient of variability, and determine the total sample size from the nomogram when setting the allowable error. σ

The general variance is

where σ 2 p - intragroup variance, a σ 2 m- intergroup dispersion.

The intragroup variance is determined by known variances variable under study within each stratum

where N i- number i-that stratum, σ 2 i- dispersion i-that stratum.

Intergroup variance are determined based on the known averages for each stratum and the general average calculated on their basis:

If the number of strata is known, but their size (and/or the size of the general population) is unknown, then the total sample size is first calculated in the indicated way, and then it is divided by the number of strata so that each stratum contains the same proportion of the sample - this will be the stratified equal sample.

If the variances within the strata are unknown, then it is necessary to know the range of variation within each stratum, that is, the values Xmax and Xmin. Then the dispersions of the strata can be calculated from the expression

If the number of strata is unknown, then the intra-group variance is calculated as a simple arithmetic mean of the variances of the strata.

If the averages in each stratum are unknown, but the range of variation is known, then the averages within the strata are defined as the averages between the extreme values ​​of the variable under study.

If the presence of strata is unknown, but the parameters of the mean, variance, and distribution density of units of observation are known from the general population, then a district sample is carried out using nested or proportional methods. If the units of observation are distributed relatively evenly over the territory where the general population is located (the coefficient of variation in the density of placement is no more than 15-25%), then nests are used, each of which contains the same number of units of observation. Nests are allocated so that they have the same size (for example, area). The number of nests is proportional to the ratio of the total sample size n to the total number of observation units N. Only one unit of observation is selected from each nest, the placement of the sample in the nests is carried out by a uniform mechanical or random method.

If the placement of observation units in the study area is uneven, then it is divided into regions with the same number of observation units in each - this is a region-by-region proportional sampling. To do this, the total sample size is calculated according to the nomogram, after which this sample is distributed among the regions in proportion to the number of observation units. Within the districts, in this case, the placement of the sample is carried out either by nesting or in another way, similar to the known procedures for placing samples.

Example 5. Let's use example 3, studying cigarette consumption. If there is no data on the possible parameters of the variable under study, then with the data P=95%, e=5%, the sample size will be 384 people. Let's single out two strata - men and women. Let it be known a priori (for example, from a pilot study) that the daily consumption of cigarette packs in men is Xmax=2, Xmin=0.33, in women Xmax=3, Xmin=0.1. Calculate the sample size in this case

Since we do not know anything about the ratio of the sizes of the strata, we assume that their numbers are equal and the shares of their numbers in the general population are 0.5 each. Then the intragroup variance will be

and intergroup

with general average

Then the general variance will be

and the coefficient of variability will be

According to the nomogram, with an allowable error of 5%, the sample size will be approximately 240 people (more than 140 less than according to the table). In this case, this sample should be divided into 120 men and 120 women.

If this sample size is too large, then it is necessary to increase the number of strata, ensuring that the range of variation in each stratum is minimal, and the sizes of the strata are close, that is, strive to minimize the total variance.

In the case when the size of the general population as a whole is known, it is possible to adjust the sample size for non-recurrence as follows:

1) for famous v% and e calculated from nomogram sample size n 1;

2) the given allowable error is adjusted for the size of the population

3) according to the nomogram for the corrected error e correct and v% the new sample size is found n 2.

Example 6 Assume that a study is conducted for a target segment of 1600 units of observation with v%=25% and e=5%. According to the nomogram, the sample size will then be 100 units of observation. Correcting the error for the sample size

According to the nomogram, the adjusted sample size will be (at v%=25% and e=5.2%) 90 observation units.

CHAPTER 1.

In this part of the work, the student processes the data he has collected and draws a conclusion regarding the task: how to solve the problem.

For processing, a student can use MS Excel, SPSS, Statistika for Windows, MatLab, MatCad and other programs for processing large data arrays. The main tasks to be solved when using these tools:

data verification:

establishment of laws of distribution;

establishing relationships between data;

data classification and segmentation;

forecasting the development of events.

Study Data Processing Sequence

  1. calculation within the framework of the analysis of bivariate distributions for each data table, coefficient of variation, correlation ratio and standard deviations4
  2. calculation of correlation and covariance matrices;
  3. selection of a data array according to predetermined conditions;
  4. calculation of distributions (taking into account the given conditions);
  5. recoding (correction of errors in data);
  6. introduction of new indicators (calculation of indices).

The table below describes the possible data analysis methods. Of course, you shouldn't use them all at once. The student chooses exactly those 1-2 methods that are most suitable for the disclosure of the problem.

Quantitative Methods for Analyzing Marketing Research Data
1.Descriptive statistics compression methods 2.Methods of analysis of scorecards
1.1 Grouping 1.2 Estimation of distribution parameters 1.3 Covariance and correlation matrix
2.1 Orientation to the integral qualitative characteristic 2.2 Quantitative orientation
2.2.1 Analysis of variance 2.2.2 Correlation-regression analysis 2.2.3 Causal analysis
2.1.1 Without a priori information about the trait under study 2.1.2 With a priori information about feature classes 2.1.3 With a priori information about the increase (decrease) of the feature)
2.1.1.1 Methods of peer review 2.1.1.2 Analysis of the data matrix.
2.1.3.1 Strengthening the scale by the resulting attribute 2.1.3.2 Assessment of the significance of the indicator (rank correlations)
2.1.1.2.1 Factor analysis 2.1.1.2.2 Latent structural analysis 2.1.1.2.3 Cluster analysis 2.1.1.2.4 Methods for assessing the significance of an indicator
2.1.2.1 Methods for strengthening the nominal scale by the resulting attribute 2.1.2.2 Assessing the materiality of system indicators
2.1.2.2.1 Pattern recognition theory methods 2.1.2.2.2 Methods of information theory 2.1.2.2.3 Graph theory methods

To determine the main characteristics, depending on the questions used, the following methods for analyzing measurements on scales in questions can be applied:

Statistical methods for identifying relationships

Scale of the resulting (final) feature Factor scale (predictor) Statistical processing method
Quantitative (I, O, A, P) Quantitative (I, O, A, P) Regressions Correlations
Quantitative (I, O, A, P) Time (T) Time series dynamics
Quantitative (I, O, A, P) Non-quantitative (C, P) Analysis of variance
Quantitative (I, O, A, P) Analysis of Covariance Typological Regression
Non-quantitative (K) Quantitative (I, O, A, P) Discriminant analysis Cluster analysis Taxonomy Splitting mixtures
Non-quantitative (P) Non-quantitative (C, P) Rank correlations Analysis of contingency tables
Quantitative and non-quantitative Quantitative and non-quantitative Logic decision functions
Types of scales in questions: I - interval, O - relative, A - absolute, P - difference, P - ordinal, K - classification (nominal)

For example, correlation analysis for consumer segmentation is performed as follows:

  1. mean values ​​are highlighted, standard deviations, coefficient of variation, error of the mean, and confidence interval;
  2. the covariance and correlation matrix is ​​calculated (for example, in MS Excel);
  3. the "proximity" of objects in the space of characteristics is calculated (for segmentation);
  4. paths of maximum correlation are computed in order to group variables;
  5. the paths of maximum distance are calculated by the distance matrix in order to classify objects;
  6. the closest groups are determined, which will be segments of consumers;
  7. a measure of the proximity of groups is checked (for example, a correlation ratio).

At the end of this chapter, the student describes the results of the data analysis so that his solutions to the tasks set for the work, the final conclusions and their formulations are clear.

Conclusion

In this section, the student formulates a complete solution to the problem posed at the beginning of his work.

Bibliography

The list of sources used (list of references) should be performed at the end of the text of the work in accordance with GOST 7.1-84, for example:

Zinnurov U. G. Fundamentals of marketing research: Tutorial/ U. G. Zinnurov; Ufimsk. state aviation tech. un-t. Ufa, 1996.- 110 p.

Sources in the list are located in alphabetical order. References must be made to all of the listed sources in the work. Page footnotes are not allowed.

If the source is Internet sites, it is necessary to indicate the full address of the site (copying its address bar) on which specific information was obtained. In this case, the date of the last access to this site is given, for example.

When you ask the question, "How many respondents do I need for a survey?", you are really asking, "How large does my sample need to be to accurately estimate my population?" Given the complexity of these concepts, we have broken down the process into 5 steps, making it easy for you to calculate your ideal sample size and ensure the accuracy of your survey results.

5 steps to make sure your sample accurately estimates the population:

Step 1

What is your general population?

By the term "general population" we mean the whole group of people whose opinion you are going to ask (the sample will consist of members of this population who will actually take part in the survey).

For example, if you want to understand how to find a market for toothpaste in France, your population will be the people of France. And if you're trying to determine how many vacation days people who work for a toothpaste company would like to have, then your population is the employees of that company.

Whether it is a country or a company, establishing a population is an important first step. Once you have decided on the population, set (approximately) its size. For example, France has about 65 million people, but a toothpaste company likely has far fewer employees.

Did you get the right number? Okay, then let's move on...

Step 2

What is the required accuracy?

This step is a kind of assessment of how much risk you are willing to take regarding the possibility of inaccurate survey responses due to the fact that you are not surveying the entire population. Therefore, you should answer two questions:

  1. How confident do you need to be that the responses you receive reflect the opinions of the general population?
    This is your margin of error. So let's say 90% of the sample members like chewing gum with the taste of grapes. A margin of error of 5% adds 5% on each side of that number, meaning that actually 85-95% of the sample likes grape-flavored gum. 5% is the most commonly used margin of error, but you can set it to between 1% and 10% depending on the survey. It is not recommended to raise this figure above 10%.
  2. How confident do you need to be that the sample accurately represents the population?

    This is your level of trust. The level of confidence is the probability that the sample is significant for the results obtained. The calculation is usually made as follows. If you randomly selected 30 more samples from this population, how often would your result for one sample differ significantly from the results for the other 30 samples? A confidence level of 95% means that 95% of the time the results would match. 95% is the most commonly used value, but you can set it to 90% or 99% depending on the poll. Lowering the confidence level value below 90% is not recommended.

Step 3

What sample size do I need?

In the table below, select an approximate target population size and a margin of error for determining the number of completed interviews required.

Now that you have your stride 1 and stride 2 values, use the handy table below to determine the size of the required sample...

Population Margin of error Trust level
10% 5% 1% 90% 95% 99%
100 50 80 99 74 80 88
500 81 218 476 176 218 286
1000 88 278 906 215 278 400
10 000 96 370 4900 264 370 623
100 000 96 383 8763 270 383 660
1 000 000+ 97 384 9513 271 384 664

Note. The data is provided as a guideline only. Also, for populations over 1 million, figures can be rounded to the nearest hundred.

Step 4

How responsive will people be?

Unfortunately, not everyone you send a survey to will receive a response.

The percentage of people who complete the survey form they receive is referred to as the "response rate." Determining the percentage of respondents to your survey will help you determine total number instances of the survey that must be sent out to receive the required number of responses.

The response rate directly depends on a number of factors, such as relationship with the target audience, the length and complexity of the survey, the incentives offered, and the topic of the survey. For online surveys where no relationship has been established with recipients in advance, response rates of 20-30% are considered very high. A more conservative and likely value is 10-14%, if you have not previously conducted a survey in this population.

Step 5

So how many people should you send the survey to?

This is an easy step!

Simply divide the number you got in step 3 by the number you got in step 4. This is your magic number.

For example, if you want 100 women who use shampoo to complete a survey and you think that 10% of the women you send the survey to will complete it, you need to send the survey to 1000 women (100/10%)!

Population- a set of units that have mass character, typicality, qualitative uniformity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Population unit- each specific unit of the statistical population.

One and the same statistical population can be homogeneous in one feature and heterogeneous in another.

Qualitative uniformity- the similarity of all units of the population for any feature and dissimilarity for all the rest.

In a statistical population, the differences between one unit of the population and another are more often of a quantitative nature. Quantitative changes in the values ​​of the attribute of different units of the population are called variation.

Feature Variation- quantitative change of a sign (for a quantitative sign) during the transition from one unit of the population to another.

sign is a property feature or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a feature in individual units of the population is called variation.

Attributive (qualitative) features are not quantifiable (composition of the population by sex). Quantitative characteristics have a numerical expression (composition of the population by age).

Index- this is a generalizing quantitative and qualitative characteristic of any property of units or aggregates for the purpose in specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon under study.

For example, consider salary:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each worker
  • Qualitative homogeneity - accrued salary
  • Feature variation - a series of numbers

General population and sample from it

The basis is a set of data obtained as a result of measuring one or more features. Really observed set of objects, statistically represented by a series of observations random variable, is an sampling, and the hypothetically existing (thought-out) - general population. The general population can be finite (number of observations N = const) or infinite ( N = ∞), and a sample from the general population is always the result of a limited number of observations. The number of observations that make up a sample is called sample size. If the sample size is large enough n→∞) the sample is considered big, otherwise it is called a sample limited volume. The sample is considered small, if, when measuring a one-dimensional random variable, the sample size does not exceed 30 ( n<= 30 ), and when measuring simultaneously several ( k) features in a multidimensional space relation n to k less than 10 (n/k< 10) . The sample forms variation series if its members are order statistics, i.e., sample values ​​of the random variable X are sorted in ascending order (ranked), the values ​​of the attribute are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as a sample of commercial banks in the country and etc.

Basic sampling methods

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the presentation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of the population can be organized in two ways: using continuous and discontinuous. Continuous observation includes examination of all units studied aggregates, a non-continuous (selective) observation- only parts of it.

There are five main ways to organize sampling:

1. simple random selection, in which objects are randomly selected from the general population of objects (for example, using a table or a random number generator), and each of the possible samples has an equal probability. Such samples are called actually random;

2. simple selection through a regular procedure is carried out using a mechanical component (for example, dates, days of the week, apartment numbers, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of volume is subdivided into subsets or layers (strata) of volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age group or social class; enterprises by industry). In this case, the samples are called stratified(otherwise, stratified, typical, zoned);

4. methods serial selection are used to form serial or nested samples. They are convenient if it is necessary to examine a "block" or a series of objects at once (for example, a consignment of goods, products of a certain series, or a population in the territorial-administrative division of the country). The selection of series can be carried out in a random or mechanical way. At the same time, a continuous survey of a certain batch of goods, or an entire territorial unit (a residential building or a quarter) is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Selection types

By mind there are individual, group and combined selection. At individual selection individual units of the general population are selected in the sample set, with group selection are qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection distinguish repeated and non-repetitive sample.

Unrepeatable called selection, in which the unit that fell into the sample does not return to the original population and does not participate in the further selection; while the number of units of the general population N reduced during the selection process. At repeated selection caught in the sample, the unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in the further selection procedure; while the number of units of the general population N remains unchanged (the method is rarely used in socio-economic studies). However, with a large N (N → ∞) formulas for unrepeated selection are close to those for repeated selection and the latter are used almost more often ( N = const).

The main characteristics of the parameters of the general and sample population

The basis of the statistical conclusions of the study is the distribution of a random variable , while the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is the sample size). The distribution of a random variable in the general population is theoretical, ideal in nature, and its sample analogue is empirical distribution. Some theoretical distributions are given analytically, i.e. them parameters determine the value of the distribution function at each point in the space of possible values ​​of the random variable . For a sample, it is difficult, and sometimes impossible, to determine the distribution function, therefore parameters are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be both statistically correct and erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and dispersion.

By their very nature, distributions are continuous and discrete. The best known continuous distribution is normal. Selective analogues of parameters and for it are: mean value and empirical variance. Among the discrete in socio-economic studies, the most commonly used alternative (dichotomous) distribution. The expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic under study (it is indicated by the letter ); the proportion of the population that does not have this feature is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analog.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for the theoretical and empirical distributions are given in Table. 9.1.

Sample share k n is the ratio of the number of units of the sample population to the number of units of the general population:

k n = n/N.

Sample share w is the ratio of units that have the trait under study x to sample size n:

w = n n / n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample fraction k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample fraction w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 9.1 Main parameters of the general and sample populations

Sampling errors

With any (solid and selective) errors of two types can occur: registration and representativeness. Mistakes registration can have random and systematic character. Random errors are made up of many different uncontrollable causes, are unintentional in nature, and usually balance each other out together (for example, changes in instrument readings due to temperature fluctuations in the room).

Systematic errors are biased, as they violate the rules for selecting objects in the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social status of the population in the city, it is planned to examine 25% of families. If, however, the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will introduce a systematic error and distort the results; the choice of the apartment number by lot is more preferable, since the error will be random.

Representativeness errors inherent only in selective observation, they cannot be avoided and they arise as a result of the fact that the sample does not fully reproduce the general one. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained during continuous observation).

Sampling error is the difference between the value of the parameter in the general population and its sample value. For the average value of a quantitative attribute, it is equal to: , and for the share (alternative attribute) - .

Sampling errors are inherent only in sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution and are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples, and therefore it is customary to calculate average error.

Average sampling error is a value expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the trait: the larger and the smaller the variation of the trait (hence, the value of ), the smaller the value of the average sampling error . The ratio between the variances of the general and sample populations is expressed by the formula:

those. for sufficiently large, we can assume that . The average sampling error shows the possible deviations of the parameter of the sample population from the parameter of the general population. In table. 9.2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 9.2 Mean error (m) of sample mean and proportion for different sample types

Where is the average of the intragroup sample variances for a continuous feature;

The average of the intra-group dispersions of the share;

— number of series selected, — total number of series;

,

where is the average of the th series;

- the general average over the entire sample for a continuous feature;

,

where is the proportion of the trait in the th series;

— the total share of the trait over the entire sample.

However, the magnitude of the average error can only be judged with a certain probability Р (Р ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and hence their deviations from the general mean, with a sufficiently large number, approximately obeys the normal distribution law, provided that the general population has a finite average and limited variance.

Mathematically, this statement for the mean is expressed as:

and for the fraction, expression (1) will take the form:

where - eat marginal sampling error, which is a multiple of the average sampling error , and the multiplicity factor is Student's criterion ("confidence factor"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and the general mean will not exceed one value of the mean error m(t=1), with probability P = 0.954 (95.4%)— that it does not exceed the value of two mean errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the value of the mean error determines error level and is not more than 0,3% .

In table. 9.3 formulas for calculating the marginal sampling error are given.

Table 9.3 Marginal sampling error (D) for mean and proportion (p) for different types of sampling

Extending Sample Results to the Population

The ultimate goal of sample observation is to characterize the general population. For small sample sizes, empirical estimates of the parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, it becomes necessary to establish the boundaries within which the true values ​​( and ) lie for the sample values ​​of the parameters ( and ).

Confidence interval of some parameter θ of the general population is called a random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

marginal error samples Δ allows you to determine the limit values ​​of the characteristics of the general population and their confidence intervals, which are equal to:

Bottom line confidence interval obtained by subtracting marginal error from the sample mean (share), and the top one by adding it.

Confidence interval for the mean, it uses the marginal sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the mean lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for the three standard confidence levels P=95%, P=99% and P=99.9% value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 and 3,29 . Thus, the marginal sampling error allows us to determine the marginal values ​​of the characteristics of the general population and their confidence intervals:

The distribution of the results of selective observation to the general population in socio-economic studies has its own characteristics, since it requires the completeness of the representativeness of all its types and groups. The basis for the possibility of such a distribution is the calculation relative error:

where Δ % - relative marginal sampling error; , .

There are two main methods for extending a sample observation to the population: direct conversion and method of coefficients.

Essence direct conversion is to multiply the sample mean!!\overline(x) by the size of the population .

Example. Let the average number of toddlers in the city be estimated by a sampling method and amount to a person. If there are 1000 young families in the city, then the number of places required in the municipal nursery is obtained by multiplying this average by the size of the general population N = 1000, i.e. will be 1200 seats.

Method of coefficients it is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

In doing so, the formula is used:

where all variables are the size of the population:

Required sample size

Table 9.4 Required sample size (n) for different types of sampling organization

When planning a sampling survey with a predetermined value of the allowable sampling error, it is necessary to correctly estimate the required sample size. This amount can be determined on the basis of the allowable error during selective observation based on a given probability that guarantees an acceptable error level (taking into account the way the observation is organized). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the marginal sampling error. So, from the expression for the marginal error:

the sample size is directly determined n:

This formula shows that with decreasing marginal sampling error Δ significantly increases the required sample size, which is proportional to the variance and the square of the Student's t-test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in Table. 9.4.

Practical Calculation Examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors in the bank, a random sample of 10 payment documents was carried out. Their values ​​turned out to be equal (in days): 10; 3; fifteen; fifteen; 22; 7; eight; one; nineteen; twenty.

Required with probability P = 0.954 determine marginal error Δ sample mean and confidence limits of the average calculation time.

Decision. The average value is calculated by the formula from Table. 9.1 for the sample population

The dispersion is calculated according to the formula from Table. 9.1.

The mean square error of the day.

The error of the mean is calculated by the formula:

those. mean value is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

The limiting error is calculated by the formula from Table. 9.3 for reselection, since the size of the population is unknown, and for P = 0.954 confidence level.

Thus, the mean value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Use of Student's table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom the obtained value is reliable with a significance level a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimate of the probability (general share) r.

With a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(the sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 define an indicator R low-income families throughout the region.

Decision. According to the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t=3(see formula 3). Marginal share error w determine by the formula from Table. 9.3 for non-repeating sampling (mechanical sampling is always non-repeating):

Limiting relative sampling error in % will be:

The probability (general share) of low-income families in the region will be p=w±Δw, and the confidence limits p are calculated based on the double inequality:

w — Δw ≤ p ≤ w — Δw, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997, it can be argued that the proportion of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3 Calculation of the mean value and confidence interval for a discrete feature specified by an interval series.

In table. 9.5. the distribution of applications for the production of orders according to the timing of their implementation by the enterprise is set.

Table 9.5 Distribution of observations by time of occurrence

Decision. The average order completion time is calculated by the formula:

The average time will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months

We get the same answer if we use the data on p i from the penultimate column of Table. 9.5 using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The dispersion is calculated by the formula

where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4) and the standard error is .

The error of the mean is calculated by the formula for months, i.e. the mean is!!\overline(x) ± m = 23.1 ± 13.4.

The limiting error is calculated by the formula from Table. 9.3 for reselection because the population size is unknown, for a 0.954 confidence level:

So the mean is:

those. its true value lies in the range from 0 to 50 months.

Example 4 To determine the speed of settlements with creditors of N = 500 enterprises of the corporation in a commercial bank, it is necessary to conduct a selective study using the method of random non-repetitive selection. Determine the required sample size n so that with a probability P = 0.954 the error of the sample mean does not exceed 3 days, if the trial estimates showed that the standard deviation s was 10 days.

Decision. To determine the number of necessary studies n, we use the formula for non-repetitive selection from Table. 9.4:

In it, the value of t is determined from for the confidence level P = 0.954. It is equal to 2. The mean square value s = 10, the population size N = 500, and the marginal error of the mean Δ x = 3. Substituting these values ​​into the formula, we get:

those. it is enough to make a sample of 41 enterprises in order to estimate the required parameter - the speed of settlements with creditors.

The total number of objects of observation (people, households, enterprises, settlements, etc.) with a certain set of characteristics (gender, age, income, number, turnover, etc.), limited in space and time. Population examples

  • All residents of Moscow (10.6 million people according to the 2002 census)
  • Muscovite men (4.9 million according to the 2002 census)
  • Russian legal entities (2.2 million at the beginning of 2005)
  • Retail outlets selling food products (20 thousand at the beginning of 2008), etc.

Sample (Sample population)

Part of the objects from the population selected for study in order to draw a conclusion about the entire population. In order for the conclusion obtained by studying the sample to be extended to the entire population, the sample must have the property of being representative.

Sample representativeness

The property of the sample to correctly reflect the general population. The same sample may or may not be representative of different populations.
Example:

  • A sample consisting entirely of Muscovites who own a car does not represent the entire population of Moscow.
  • The sample of Russian enterprises with up to 100 employees does not represent all enterprises in Russia.
  • The sample of Muscovites making purchases in the market does not represent the purchasing behavior of all Muscovites.

At the same time, these samples (subject to other conditions) can perfectly represent Muscovite car owners, small and medium-sized Russian enterprises and buyers making purchases in the markets, respectively.
It is important to understand that sample representativeness and sampling error are different phenomena. Representativeness, unlike error, does not depend on sample size.
Example:
No matter how much we increase the number of surveyed Muscovites-car owners, we will not be able to represent all Muscovites with this sample.

Sampling error (confidence interval)

The deviation of the results obtained with the help of sample observation from the true data of the general population.
There are two types of sampling error: statistical and systematic. The statistical error depends on the sample size. The larger the sample size, the lower it is.
Example:
For a simple random sample of 400 units, the maximum statistical error (with 95% confidence) is 5%, for a sample of 600 units - 4%, for a sample of 1100 units - 3% .
The systematic error depends on various factors that have a constant impact on the study and bias the results of the study in a certain direction.
Example:

  • The use of any probability sample underestimates the proportion of high-income people who are active. This happens due to the fact that such people are much more difficult to find in any particular place (for example, at home).
  • The problem of respondents who refuse to answer questions (the share of “refuseniks” in Moscow, for different surveys, ranges from 50% to 80%)

In some cases, when true distributions are known, bias can be leveled out by introducing quotas or reweighting the data, but in most real studies, even estimating it can be quite problematic.

Sample types

Samples are divided into two types:

  • probabilistic
  • improbability

1. Probability samples
1.1 Random sampling (simple random selection)
Such a sample assumes the homogeneity of the general population, the same probability of the availability of all elements, the presence of a complete list of all elements. When selecting elements, as a rule, a table of random numbers is used.
1.2 Mechanical (systematic) sampling
A kind of random sample, sorted by some attribute (alphabetical order, phone number, date of birth, etc.). The first element is selected randomly, then every 'k'th element is selected in increments of 'n'. The size of the general population, while - N=n*k
1.3 Stratified (zoned)
It is used in case of heterogeneity of the general population. The general population is divided into groups (strata). In each stratum, selection is carried out randomly or mechanically.
1.4 Serial (nested or clustered) sampling
With serial sampling, the units of selection are not the objects themselves, but groups (clusters or nests). Groups are selected randomly. Objects within groups are surveyed all over.

2. Incredible samples
The selection in such a sample is carried out not according to the principles of chance, but according to subjective criteria - accessibility, typicality, equal representation, etc.
2.1. Quota sampling
Initially, a certain number of groups of objects are allocated (for example, men aged 20-30 years, 31-45 years and 46-60 years; persons with an income of up to 30 thousand rubles, with an income of 30 to 60 thousand rubles and with an income of more than 60 thousand rubles ) For each group, the number of objects to be surveyed is specified. The number of objects that should fall into each of the groups is set, most often, either in proportion to the previously known share of the group in the general population, or the same for each group. Within the groups, objects are selected randomly. Quota sampling is used quite often.
2.2. Snowball Method
The sample is constructed as follows. Each respondent, starting with the first, is asked to contact his friends, colleagues, acquaintances who would fit the selection conditions and could take part in the study. Thus, with the exception of the first step, the sample is formed with the participation of the objects of study themselves. The method is often used when it is necessary to find and interview hard-to-reach groups of respondents (for example, respondents with a high income, respondents belonging to the same professional group, respondents who have some similar hobbies / passions, etc.)
2.3 Spontaneous sampling
The most accessible respondents are polled. Typical examples of spontaneous samples are in newspapers/magazines given to respondents for self-completion, most Internet surveys. The size and composition of spontaneous samples is not known in advance, and is determined by only one parameter - the activity of the respondents.
2.4 Sample of typical cases
Units of the general population are selected that have an average (typical) value of the attribute. This raises the problem of choosing a feature and determining its typical value.

Course of lectures on the theory of statistics

More detailed information on sample observations can be obtained by viewing.

One of the main components of a well-designed study is the definition of the sample and what is a representative sample. It's like the cake example. After all, it is not necessary to eat the whole dessert to understand its taste? A small part is enough.

So, the cake is population (that is, all respondents who qualify for the survey). It can be expressed territorially, for example, only residents of the Moscow region. Gender - only women. Or have age restrictions - Russians are over 65 years old.

It is difficult to calculate the population: you need to have data from the population census or preliminary assessment surveys. Therefore, usually the general population is “estimated”, and from the resulting number they calculate sampling frame or sampling.

What is a representative sample?

Sample is a well-defined number of respondents. Its structure should coincide as much as possible with the structure of the general population in terms of the main characteristics of the selection.

For example, if the potential respondents are the entire population of Russia, where 54% are women and 46% are men, then the sample should contain exactly the same percentage. If the parameters match, then the sample can be called representative. This means that inaccuracies and errors in the study are minimized.

The sample size is determined taking into account the requirements of accuracy and economy. These requirements are inversely proportional to each other: the larger the sample size, the more accurate the result. Moreover, the higher the accuracy, the correspondingly more costs are required for the study. And vice versa, the smaller the sample, the less it costs, the less accurately and more randomly the properties of the general population are reproduced.

Therefore, to calculate the amount of choice, sociologists invented a formula and created special calculator:

Confidence probability and confidence error

What do the terms " confidence level" and " confidence error"? The confidence level is a measure of the accuracy of the measurements. A confidence error is a possible error in the results of the study. For example, with a general population of more than 500,00 people (for example, living in Novokuznetsk), the sample will be 384 people with a confidence level of 95% and an error of 5% OR (with a confidence interval of 95 ± 5%).

What follows from this? When conducting 100 studies with such a sample (384 people), in 95 percent of cases, the answers received, according to the laws of statistics, will be within ± 5% of the original. And we will get a representative sample with a minimum probability of statistical error.

After the sample size calculation is done, you can see if there are enough respondents in the demo version of the Questionnaire Panel. You can learn more about how to conduct a panel survey.



error: Content is protected!!