Problems of the multivariate statistical analysis

In realizing the statistical analysis, first of all it is necessary to define which objects and for what purpose we want to analyze i.e. to formulate the purpose of statistical research. If the information about objects of the analysis is not collected, it is necessary to define what objects and how to choose these, what characteristics of objects are important to us and how to receive the information about these characteristics.

*Statistical set of objects* are those objects, the phenomena, events etc. which enter into a circle of interests of the researcher during the solution of some specific target of the analysis. It may be, for example, a set of all enterprises of any branch, or all patients, suffering some illness etc.

*A Sample of objects* is that part of a statistical set, about which the information is known to the researcher. More often, the sample *o*^{1},...,*o ^{N}* is formed as a result of random selection of some representatives of the set. Number

*The characteristic of object* is on the basis of which it is possible to describe and distinguish objects. For example, it may be the number of employees of the enterprise or age of the patient. Other equivalent names such as parameter, attribute and factor are frequently used. In mathematical statistics the term *variable* is used. To describe the characteristic, it is necessary to put its name and the set of values. There are the following basic types of characteristics: *quantitative, qualitative and ordered. *

With values *of quantitative* characteristics, it is possible to carry out various arithmetic operations: addition, multiplication, division etc. With *qualitative* characteristics it is impossible to carry out such operations, but it is possible to check coincidence of values of these characteristics. For example, there is no sense in trying to divide one profession into another. With *ordered* characteristics it is allowed to compare their values according to the given order. For example, it is possible to compare a grade of various products or to tell, which officer in the military is higher ranked.

For convenience of the analysis, ordered characteristics can be ranked i.e. to give each value a rank (ordered number) according to increasing or decreasing.

*The set of characteristics *contains various characteristics *X*_{1},...,*X _{n}*, by which objects are described. A set

A set of possible values, which may accept X, is called *space of characteristics*. The set of characteristics may also include dependent characteristics *Y*_{1},..*Y _{m}*

We can understand *observation *as* *two things: process of measurement of characteristics and the result of this process. If the result of the observation of characteristics of some object can change by case, we can speak about random observation. For any object *î* a set of observation of its characteristics is a set *x=x*(*o*)=*x*_{1}(*o*),...,*x _{n}*(

*The set of observations* is a set of measurements of characteristics for objects from the sample. This set is usually represented as the data table. *Data* with *N* rows and *n* columns: *Data*={*x ^{i}_{j}*},

For some reasons, some observations of any characteristics may remain unknown. In this case we can say that the table of the data contains *missed values*. These *missed value *are coded by a special number or by a special symbol.

*A time series* is a set of observations of the characteristic of one object at the various moments of time *t*^{1},...,*t ^{N}*. A multivariate time series represents a set of observations of several characteristics of one object.

The main goal of the statistical analysis consists in using the given set of observations, to catch the latent statistical regularities in the data, to establish influences between given random characteristics and other characteristics and to construct a model of dependence. The given set of observations is also called training sample (by analogy to process of person training).

*The Model of dependence* is the mathematical record of how one or several characteristics depend on other characteristics. The model can be described as the formula, the equation or system of the equations, a set of logic statements and graphically as a decision tree. The model can be used for forecasting values of the characteristic on value of other characteristics. Thus, conformity between sets of values of these characteristics (this conformity refers to as *decision function*) is established.

Let us consider the following basic kinds of statements of problems of the statistical analysis.

*Regression Analysis (RA).* In this kind of the statistical analysis, it is required to find a model of dependence of quantitative characteristic *Y* from other characteristics *X*_{1},...,*X _{n}*.

*Pattern Recognition Problem (PRP)* (synonyms: the discriminant analysis, supervised classification). In this case, the dependent characteristic is qualitative, its values are called classes (patterns). It is necessary to find the model which would allow to predict a class depending on values of other characteristics. We will consider a variant of a recognition problem in which the cost of an error is not taken into account, i.e. it is not important, to which classes instead of true observation is referred; only the fact of an error is important. Sometimes a recognition problem with two classes is considered as a regression analysis problem in which instead of the class, it predicts probability (i.e. the quantitative characteristic).

If the predicted characteristic is ordered then such a problem can also be presented as a regression analysis problem, in this case the ranks turn out by a rounding off of values of the predicted quantitative characteristic.

*The time series analysis.* It is required to predict the value of any (quantitative or qualitative) characteristics at some future moment of time on values of all characteristics in the past.

*The Cluster analysis* (synonyms: an automatic grouping, unsupervised classification). In this kind of analysis there is no dependent characteristic; it is necessary to generate groups of objects, so that inside each group of objects, the elements are the closest to each other while on the same time the elements of various groups have to be farther away from each other.