Basic characteristics of random variables. Introduction to correlation analysis. Basics of regression analysis Closeness of linear relationship between random variables

Having determined the equation of the theoretical regression line, it is necessary to quantify the closeness of the relationship between two series of observations. The regression lines drawn in Fig. 4.1, b, c, are the same, but in Fig. 4.1, b the points are much closer (closer) to the regression line than in Fig. 4.1, c.

In correlation analysis, it is assumed that factors and responses are random in nature and obey normal law distributions.

The closeness of the connection between random variables characterized by the correlation ratio p xy. Let's take a closer look at physical sense this indicator. To do this, we introduce new concepts.

The residual dispersion 5^res characterizes the scatter experimentally

observed points relative to the regression line and represents an indicator of the error in predicting the parameter y according to the regression equation (Fig. 4.6):

s2 =f) can differ significantly from the corresponding characteristics of the original (undistorted) scheme (, l) - So, for example, below (see section 1.1.4) it is shown that the imposition of random normal errors on the original two-dimensional normal scheme (, m) always reduces the absolute value of the regression coefficient Ql in relation (B. 15), and also weakens the degree of closeness of the connection between it (i.e., it reduces the absolute value of the correlation coefficient r).

The influence of measurement errors on the value of the correlation coefficient. Suppose we want to estimate the degree of closeness of the correlation between the components of a two-dimensional normal random variable (, TJ), but we can observe them only with some random measurement errors es and e, respectively (see diagram of the D2 dependence in the introduction). Therefore, the experimental data (xit i/i), i = 1, 2,. .., l, are practically sample values of the distorted two-dimensional random variable (, r)), where =

Method R.a. consists of deriving a regression equation (including estimating its parameters), with the help of which the average value of a random variable is found if the value of another (or others in the case of multiple or multivariate regression) is known. (In contrast, correlation analysis is used to find and express the strength of relationships between random variables71.)

In the study of the correlation of signs that are not associated with a consistent change over time, each sign changes under the influence of many reasons, taken as random. In the dynamics series, the change in the time of each series is added to them. This change leads to the so-called autocorrelation - the influence of changes in the levels of previous series on subsequent ones. Therefore, the correlation between the levels of time series correctly shows the close connection between the phenomena reflected in the time series only if there is no autocorrelation in each of them. In addition, autocorrelation leads to a distortion of the value of the mean square errors of regression coefficients, which makes it difficult to construct confidence intervals for regression coefficients, as well as to test their significance.

The theoretical and sample correlation coefficients determined by relations (1.8) and (1.8), respectively, can be formally calculated for any two-dimensional observation system; they are measures of the degree of closeness of the linear statistical relationship between the analyzed characteristics. However, only in the case of a joint normal distribution of the random variables under study and q does the correlation coefficient r have a clear meaning as a characteristic of the degree of closeness of the connection between them. In particular, in this case, the ratio r - 1 confirms a purely functional linear relationship between the quantities under study, and the equation r = 0 indicates their complete mutual independence. In addition, the correlation coefficient, together with the means and variances of random variables and TJ, constitutes those five parameters that provide comprehensive information about

Regression analysis

Processing the experimental results using the method

When studying the functioning processes complex systems one has to deal with a whole series of simultaneously acting random variables. To understand the mechanism of phenomena, cause-and-effect relationships between elements of the system, etc., based on the observations obtained, we try to establish the relationships between these quantities.

IN mathematical analysis the dependence, for example, between two quantities is expressed by the concept of function

where each value of one variable corresponds to only one value of another. This dependence is called functional.

The situation with the concept of dependence of random variables is much more complicated. As a rule, between random variables (random factors) that determine the functioning of complex systems, there is usually such a connection in which with a change in one value the distribution of another changes. This connection is called stochastic, or probabilistic. In this case, the magnitude of the change in the random factor Y, corresponding to the change in value X, can be broken down into two components. The first is related to addiction. Y from X, and the second with the influence of “own” random components Y And X. If the first component is missing, then the random variables Y And X are independent. If the second component is missing, then Y And X depend functionally. If both components are present, the relationship between them determines the strength or closeness of the connection between random variables Y And X.

There are various indicators that characterize certain aspects of the stochastic relationship. Thus, a linear relationship between random variables X And Y determines the correlation coefficient.

where are the mathematical expectations of random variables X and Y.

– standard deviations of random variables X And Y.

The linear probabilistic dependence of random variables is that when one random variable increases, the other tends to increase (or decrease) according to a linear law. If random variables X And Y are connected by a strict linear functional dependence, for example,

y=b 0 +b 1 x 1,

then the correlation coefficient will be equal to ; and the sign corresponds to the sign of the coefficient b 1.If the values X And Y are connected by an arbitrary stochastic dependence, then the correlation coefficient will vary within

It should be emphasized that for independent random variables the correlation coefficient equal to zero. However, the correlation coefficient as an indicator of the dependence between random variables has serious drawbacks. Firstly, from the equality r= 0 does not imply independence of random variables X And Y(except for random variables subject to the normal distribution law, for which r= 0 means at the same time the absence of any dependence). Secondly, extreme values are also not very useful, since they do not correspond to any functional dependence, but only to a strictly linear one.

Full description dependencies Y from X, and, moreover, expressed in exact functional relationships, can be obtained by knowing the conditional distribution function.

It should be noted that in this case one of the observed variables is considered non-random. By simultaneously fixing the values of two random variables X And Y, when comparing their values, we can attribute all errors only to the value Y. Thus, the observation error will consist of its own random error of magnitude Y and from the comparison error arising due to the fact that with the value Y not exactly the same value being compared X which actually took place.

However, finding the conditional distribution function, as a rule, turns out to be a very difficult task. The easiest way to investigate the relationship between X And Y at normal distribution Y, since it is completely determined by the mathematical expectation and variance. In this case, to describe the dependence Y from X there is no need to build a conditional distribution function, but just indicate how when changing the parameter X the mathematical expectation and variance of the quantity change Y.

Thus, we come to the need to find only two functions:

Conditional Variance Dependence D from parameter X is called schodastic dependencies. It characterizes the change in the accuracy of the observation technique when a parameter changes and is used quite rarely.

Dependence of conditional mathematical expectation M from X is called regression, it gives the true dependence of the quantities X And U, devoid of all random layers. Therefore, the ideal goal of any study of dependent variables is to find a regression equation, and the variance is used only to assess the accuracy of the result obtained.
The purpose of correlation analysis is to identify an estimate of the strength of the connection between random variables (features) that characterize some real process.
Problems of correlation analysis:
a) Measuring the degree of coherence (closeness, strength, severity, intensity) of two or more phenomena.
b) Selection of factors that have the most significant impact on the resulting attribute, based on measuring the degree of connectivity between phenomena. Factors that are significant in this aspect are used further in regression analysis.
c) Detection of unknown causal relationships.
The forms of manifestation of relationships are very diverse. The most common types are functional (complete) and correlation (incomplete) connection.
Correlation manifests itself on average for mass observations, when the given values of the dependent variable correspond to a certain series of probabilistic values of the independent variable. The relationship is called correlation, if each value of the factor characteristic corresponds to a well-defined non-random value of the resultant characteristic.
A visual representation of a correlation table is the correlation field. It is a graph where X values are plotted on the abscissa axis, Y values are plotted on the ordinate axis, and combinations of X and Y are shown by dots. By the location of the dots, one can judge the presence of a connection.
Indicators of connection closeness make it possible to characterize the dependence of the variation of the resulting trait on the variation of the factor trait.
A more advanced indicator of the degree of crowding correlation connection is linear correlation coefficient. When calculating this indicator, not only deviations of individual values of a characteristic from the average are taken into account, but also the very magnitude of these deviations.
The key questions of this topic are the regression equations between the resulting characteristic and the explanatory variable, the method least squares to estimate the parameters of the regression model, analyze the quality of the resulting regression equation, construct confidence intervals for predicting the values of the resulting characteristic using the regression equation.
Example 2

System of normal equations.
a n + b∑x = ∑y
a∑x + b∑x 2 = ∑y x
For our data, the system of equations has the form
30a + 5763 b = 21460
5763 a + 1200261 b = 3800360
From the first equation we express A and substitute into the second equation:
We get b = -3.46, a = 1379.33
Regression equation:
y = -3.46 x + 1379.33
2. Calculation of regression equation parameters.
Sample means.

Sample variances:

Standard deviation

1.1. Correlation coefficient
Covariance.

We calculate the indicator of connection closeness. This indicator is the sample linear correlation coefficient, which is calculated by the formula:

The linear correlation coefficient takes values from –1 to +1.
Connections between characteristics can be weak and strong (close). Their criteria are assessed on the Chaddock scale:
0.1 < r xy < 0.3: слабая;
0.3 < r xy < 0.5: умеренная;
0.5 < r xy < 0.7: заметная;
0.7 < r xy < 0.9: высокая;
0.9 < r xy < 1: весьма высокая;
In our example, the relationship between trait Y and factor X is high and inverse.
In addition, the linear pair correlation coefficient can be determined through the regression coefficient b:

1.2. Regression equation(estimation of regression equation).

The linear regression equation is y = -3.46 x + 1379.33

Coefficient b = -3.46 shows the average change in the effective indicator (in units of measurement y) with an increase or decrease in the value of factor x per unit of its measurement. In this example, with an increase of 1 unit, y decreases by -3.46 on average.
The coefficient a = 1379.33 formally shows the predicted level of y, but only if x = 0 is close to the sample values.
But if x=0 is far from the sample values of x, then a literal interpretation may lead to incorrect results, and even if the regression line describes the observed sample values fairly accurately, there is no guarantee that this will also be the case when extrapolating left or right.
By substituting the appropriate x values into the regression equation, we can determine the aligned (predicted) values of the performance indicator y(x) for each observation.
The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). In our example, the connection is reverse.
1.3. Elasticity coefficient.
It is not advisable to use regression coefficients (in example b) to directly assess the influence of factors on a resultant characteristic if there is a difference in the units of measurement of the resultant indicator y and the factor characteristic x.
For these purposes, elasticity coefficients and beta coefficients are calculated.
The average elasticity coefficient E shows by what percentage on average the result will change in the aggregate at from its average value when the factor changes x by 1% of its average value.
The elasticity coefficient is found by the formula:

The elasticity coefficient is less than 1. Therefore, if X changes by 1%, Y will change by less than 1%. In other words, the influence of X on Y is not significant.
Beta coefficient shows by what part of the value of its standard deviation the average value of the resulting characteristic will change when the factor characteristic changes by the value of its standard deviation with the value of the remaining independent variables fixed at a constant level:

Those. an increase in x by the standard deviation S x will lead to a decrease in the average value of Y by 0.74 standard deviation S y .
1.4. Approximation error.
Let us evaluate the quality of the regression equation using the error of absolute approximation. Average approximation error - average deviation of calculated values from actual ones:

Since the error is less than 15%, this equation can be used as regression.
Analysis of variance.
The purpose of analysis of variance is to analyze the variance of the dependent variable:
∑(y i - y cp) 2 = ∑(y(x) - y cp) 2 + ∑(y - y(x)) 2
Where
∑(y i - y cp) 2 - total sum of squared deviations;
∑(y(x) - y cp) 2 - the sum of squared deviations due to regression (“explained” or “factorial”);
∑(y - y(x)) 2 - residual sum of squared deviations.
Theoretical correlation relationship for a linear connection is equal to the correlation coefficient r xy .
For any form of dependence, the tightness of the connection is determined using multiple correlation coefficient:

This coefficient is universal, as it reflects the closeness of the relationship and the accuracy of the model, and can also be used for any form of connection between variables. When constructing a one-factor correlation model, the multiple correlation coefficient is equal to the pair correlation coefficient r xy.
1.6. Determination coefficient.
The square of the (multiple) correlation coefficient is called the coefficient of determination, which shows the proportion of variation in the resultant attribute explained by the variation in the factor attribute.
Most often, when interpreting the coefficient of determination, it is expressed as a percentage.
R2 = -0.742 = 0.5413
those. in 54.13% of cases, changes in x lead to changes in y. In other words, the accuracy of selecting the regression equation is average. The remaining 45.87% of the change in Y is explained by factors not taken into account in the model.
Bibliography

Econometrics: Textbook / Ed. I.I. Eliseeva. – M.: Finance and Statistics, 2001, p. 34..89.
Magnus Y.R., Katyshev P.K., Peresetsky A.A. Econometrics. Beginner course. Tutorial. – 2nd ed., rev. – M.: Delo, 1998, p. 17..42.
Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. – M.: Finance and Statistics, 2001, p. 5..48.

The company employs 10 people. Table 2 shows data on their work experience and

monthly salary.

Calculate using these data

- the value of the sample covariance estimate;

- the value of the sample Pearson correlation coefficient;

- estimate the direction and strength of the connection from the obtained values;

- determine how legitimate it is to say that this company uses the Japanese management model, which assumes that the more time an employee spends in a given company, the higher his salary should be.

Based on the correlation field, we can hypothesize (for the population) that the relationship between all possible values of X and Y is linear.

To calculate the regression parameters, we will build a calculation table.

Sample means.

Sample variances:

The estimated regression equation will be

y = bx + a + e,

where ei are the observed values (estimates) of errors ei, a and b, respectively, estimates of parameters b and in the regression model that should be found.

To estimate parameters b and c, the least squares method (least squares method) is used.

System of normal equations.

a?x + b?x2 = ?y*x

For our data, the system of equations has the form

10a + 307 b = 33300

307 a + 10857 b = 1127700

Let's multiply equation (1) of the system by (-30.7), we get a system that we solve by the method of algebraic addition.

-307a -9424.9 b = -1022310

307 a + 10857 b = 1127700

We get:

1432.1 b = 105390

Where does b = 73.5912 come from?

Now let’s find the coefficient “a” from equation (1):

10a + 307 b = 33300

10a + 307 * 73.5912 = 33300

10a = 10707.49

We obtain empirical regression coefficients: b = 73.5912, a = 1070.7492

Regression equation (empirical regression equation):

y = 73.5912 x + 1070.7492

Covariance.

In our example, the connection between trait Y and factor X is high and direct.

Therefore, we can safely say that the more time an employee works in a given company, the higher his salary.

4. Testing statistical hypotheses. When solving this problem, the first step is to formulate a testable hypothesis and an alternative one.

Checking the equality of general shares.

A study was conducted on student performance at two faculties. The results for the options are given in Table 3. Is it possible to say that both faculties have the same percentage of excellent students?

Simple arithmetic average

We test the hypothesis about the equality of the general shares:

Let's find the experimental value of the Student's criterion:

Number of degrees of freedom

f = nх + nу - 2 = 2 + 2 - 2 = 2

Determine the tkp value using the Student distribution table

Using the Student's table we find:

Ttable(f;b/2) = Ttable(2;0.025) = 4.303

Using the table of critical points of the Student distribution at a significance level b = 0.05 and a given number of degrees of freedom, we find tcr = 4.303

Because tob > tcr, then the null hypothesis is rejected, the general shares of the two samples are not equal.

Checking the uniformity of the general distribution.

University officials want to find out how the popularity of the humanities department has changed over time. The number of applicants who applied to this faculty was analyzed in relation to the total number of applicants in the corresponding year. (Data are given in Table 4). If we consider the number of applicants to be a representative sample of the total number of school graduates of the year, can we say that the interest of schoolchildren in the specialties of this faculty does not change over time?

Option 4

Solution: Table for calculating indicators.

Middle of the interval, xi

Accumulated frequency, S

Frequency, fi/n

To evaluate the distribution series, we find the following indicators:

Weighted average

The range of variation is the difference between the maximum and minimum values of the primary series characteristic.

R = 2008 - 1988 = 20 Dispersion - characterizes the measure of dispersion around its average value (a measure of dispersion, i.e. deviation from the average).

Standard deviation (average sampling error).

Each value of the series differs from the average value 2002.66 by an average of 6.32

Testing the hypothesis about the uniform distribution of the population.

In order to test the hypothesis about the uniform distribution of X, i.e. according to the law: f(x) = 1/(b-a) in the interval (a,b) it is necessary:

Estimate the parameters a and b - the ends of the interval in which possible values of X were observed, using the formulas (the * sign denotes parameter estimates):

Find the probability density of the expected distribution f(x) = 1/(b* - a*)

Find theoretical frequencies:

n1 = nP1 = n = n*1/(b* - a*)*(x1 - a*)

n2 = n3 = ... = ns-1 = n*1/(b* - a*)*(xi - xi-1)

ns = n*1/(b* - a*)*(b* - xs-1)

Compare empirical and theoretical frequencies using the Pearson criterion, taking the number of degrees of freedom k = s-3, where s is the number of initial sampling intervals; if a combination of small frequencies, and therefore the intervals themselves, was carried out, then s is the number of intervals remaining after the combination. Let us find estimates for the parameters a* and b* of the uniform distribution using the formulas:

Let us find the density of the assumed uniform distribution:

f(x) = 1/(b* - a*) = 1/(2013.62 - 1991.71) = 0.0456

Let's find the theoretical frequencies:

n1 = n*f(x)(x1 - a*) = 0.77 * 0.0456(1992-1991.71) = 0.0102

n5 = n*f(x)(b* - x4) = 0.77 * 0.0456(2013.62-2008) = 0.2

ns = n*f(x)(xi - xi-1)

Since the Pearson statistic measures the difference between the empirical and theoretical distributions, the greater its observed value Kob, the stronger the argument against the main hypothesis.

Therefore, the critical region for these statistics is always right-handed :)

Middle of the interval, xi	Accumulated frequency, S	Frequency, fi/n