Problems of the multivariate statistical analysis

In realizing the statistical analysis, first of all it is necessary to define which objects and for what purpose we want to analyze i.e. to formulate the purpose of statistical research. If the information about objects of the analysis is not collected, it is necessary to define what objects and how to choose these, what characteristics of objects are important to us and how to receive the information about these characteristics.

Statistical set of objects are those objects, the phenomena, events etc. which enter into a circle of interests of the researcher during the solution of some specific target of the analysis. It may be, for example, a set of all enterprises of any branch, or all patients, suffering some illness etc.

A Sample of objects is that part of a statistical set, about which the information is known to the researcher. More often, the sample o1,...,oN is formed as a result of random selection of some representatives of the set. Number N of these representatives is called volume of sample.

The characteristic of object is on the basis of which it is possible to describe and distinguish objects. For example, it may be the number of employees of the enterprise or age of the patient. Other equivalent names such as parameter, attribute and factor are frequently used. In mathematical statistics the term variable is used. To describe the characteristic, it is necessary to put its name and the set of values. There are the following basic types of characteristics: quantitative, qualitative and ordered.

With values of quantitative characteristics, it is possible to carry out various arithmetic operations: addition, multiplication, division etc. With qualitative characteristics it is impossible to carry out such operations, but it is possible to check coincidence of values of these characteristics. For example, there is no sense in trying to divide one profession into another. With ordered characteristics it is allowed to compare their values according to the given order. For example, it is possible to compare a grade of various products or to tell, which officer in the military is higher ranked.

For convenience of the analysis, ordered characteristics can be ranked i.e. to give each value a rank (ordered number) according to increasing or decreasing.

The set of characteristics contains various characteristics X1,...,Xn, by which objects are described. A set ={X1,...,Xn} may contain characteristics of one type and may include characteristics of different types (both quantitative and qualitative).

A set of possible values, which may accept X, is called space of characteristics. The set of characteristics may also include dependent characteristics Y1,..Ym , i.e. such characteristics, so that each depends from other characteristics X. We shall consider a case m=1 i.e. where there is one dependent characteristic Y.

We can understand observation as two things: process of measurement of characteristics and the result of this process. If the result of the observation of characteristics of some object can change by case, we can speak about random observation. For any object a set of observation of its characteristics is a set x=x(o)=x1(o),...,xn(o) where xj(o) designates the value of characteristic Xj for object .

The set of observations is a set of measurements of characteristics for objects from the sample. This set is usually represented as the data table. Data with N rows and n columns: Data={xij}, where value xij, is taking place on crossing i-th line and j-th column and corresponds to observation j-th characteristics of i-th object: xij =Xj(oi).

For some reasons, some observations of any characteristics may remain unknown. In this case we can say that the table of the data contains missed values. These missed value are coded by a special number or by a special symbol.

A time series is a set of observations of the characteristic of one object at the various moments of time t1,...,tN. A multivariate time series represents a set of observations of several characteristics of one object.

The main goal of the statistical analysis consists in using the given set of observations, to catch the latent statistical regularities in the data, to establish influences between given random characteristics and other characteristics and to construct a model of dependence. The given set of observations is also called training sample (by analogy to process of person training).

The Model of dependence is the mathematical record of how one or several characteristics depend on other characteristics. The model can be described as the formula, the equation or system of the equations, a set of logic statements and graphically as a decision tree. The model can be used for forecasting values of the characteristic on value of other characteristics. Thus, conformity between sets of values of these characteristics (this conformity refers to as decision function) is established.

Let us consider the following basic kinds of statements of problems of the statistical analysis.

Regression Analysis (RA). In this kind of the statistical analysis, it is required to find a model of dependence of quantitative characteristic Y from other characteristics X1,...,Xn.

Pattern Recognition Problem (PRP) (synonyms: the discriminant analysis, supervised classification). In this case, the dependent characteristic is qualitative, its values are called classes (patterns). It is necessary to find the model which would allow to predict a class depending on values of other characteristics. We will consider a variant of a recognition problem in which the cost of an error is not taken into account, i.e. it is not important, to which classes instead of true observation is referred; only the fact of an error is important. Sometimes a recognition problem with two classes is considered as a regression analysis problem in which instead of the class, it predicts probability (i.e. the quantitative characteristic).

If the predicted characteristic is ordered then such a problem can also be presented as a regression analysis problem, in this case the ranks turn out by a rounding off of values of the predicted quantitative characteristic.

The time series analysis. It is required to predict the value of any (quantitative or qualitative) characteristics at some future moment of time on values of all characteristics in the past.

The Cluster analysis (synonyms: an automatic grouping, unsupervised classification). In this kind of analysis there is no dependent characteristic; it is necessary to generate groups of objects, so that inside each group of objects, the elements are the closest to each other while on the same time the elements of various groups have to be farther away from each other.