Synthesis Mathematical Models

Problem statement: Mathematical modeling of different natural and tech nical objects and processes is one of the most important directions t hat needs high performance computing with huge memory. To reduce the computational time and expenses we need to carry out the calculations on specialized subunits. Approach: We described a self-organizing approximation method and introduced a new methodology of structural synthesi s of specialized parallel processing subunits for realizing a group method of data handling algorithm s. Results: The design procedure of the parallel subunit in addition to the selection of the computi ng units for this device has been introduced. Conclusion/Recommendations: The Group Method of Data Handling proved to be most effective to solve small and medium-sized problems with continuous output. It was tested on wide range of artificial and real-world problems.


INTRODUCTION
One of the most common problems in engineering design and control is the problem of mathematical modeling. Consider the object under investigation as "black box" with several input variables (inputs) and one output variable (output). The purpose of modeling is to find some means of predicting the value output for any values of input, based on a set of learning data.
One of the methods of the mathematical modeling used for this purpose is the Group Method of Data Handling (GMDH) (Ivakhnenko, 1971;Farlow, 1984;Ivakhnenko et al., 1994;Dolenko et al., 1996).
There were many papers published and several books devoted to group method of data handling and its applications. GMDH can be considered as further propagation of inductive self-organizing methods to the solution of more complex practical problems (Ivakhnenko and Ivakhnenko, 1995). Most of GMDH algorithms use the polynomial reference functions. This method involves sorting, that is successive testing of models selected out of a set of candidate models according to specified criterion. Nearly all known GMDH algorithms use polynomial support functions. General connection between input and output variables can be found in the form of functional Volterra series, whose discrete analogue is known as the Kolmogorov-Gabor polynomial (Madala and Ivakhnenko, 1994 In the iterative multilayered GMDH algorithm the iteration rule remains unchanged for all sequence, as shown in Fig. 1, the first layer tests the models that can be derived from the information contained in any two columns of the sample. The second uses information from four columns, the third from any eight columns, and so forth. the exhaustive-search termination rule is that in each layer the optimal models are selected by the minimum of external criterion e.g.: Where: 1 k E = Selection criterion for k th partial description of the first layer y2 i = The value of the function f(x 1 ,x 2 ) on 2i th point initial the experimental data m-number of testing points

Basics of the method:
The idea of GMDH is the following: we are trying to build an analytical function (called "model") which would behave itself in such a way that the predicted value of the output would be as close as possible to its actual value. For many applications such an analytical model is much more convenient than the "distributed knowledge" representation that is typical for neural network approach. The most common way to deal with such problem is to use linear regressing approach. In this approach, first of all we must introduce a set of basis functions. The answer will then be sought as a linear combination of the basis functions. For example, powers of input variables along with their double and triple crossproducts may be chosen as bases functions.
To obtain the best solution, we should try all possible combinations of terms and choose those which give best predictions. The decision about quality of each model must be made using some numeric criterion. (Accurate choice of the criterion is separate problem.) However, it is clear that full testing for a problem with many inputs and a wide set of a basis functions is practically impossible, as it would take too much time and it would require too much computer memory, to reduce computational expenses, one should reduce the number of basis functions (and the number of input variables), which are used to build the tested models. To do that, one must change from one-stage procedure of model selection to a multi-stage procedure.
Let us take two input variables and let us combine a set of basis functions. For example, if we denote input variables as x 1 and x 2 , let the set of basis functions be {1,x 1 ,x 2 , x 1 .x 2 }.(1 corresponds to constant bias and must be always included in the set). Now we check 2 4 -1=15 possible models and choose one that is the best.
(Any one of the tested models is often called partial description or PD). After that, we take another pair of input variables and repeat operation, resulting in one more PD with its own value of criterion. Doing the same for each possible pair of n input variables, we obtain n*(n-1)/2PDs, each with its own value of the used criterion.
Then we compare these values and choose several PDs which give better approximation for the output variable. Usually we select a pre-defined number F of best PDs that must be preserved at the next step of algorithm.
The values predicted by the preserved PDs (Called Survivors), serve at the next iteration as input variable along with initial input variables of the whole system. All the described actions are repeated again with the broadened set of input variables and then the next iteration goes, and so on.
This method involves sorting, that is successive testing of model selected out of a set of candidate models according to a specified criterion. Nearly all known GMDH algorithms use polynomial support functions.

GMDH algorithms realization:
A parallel computing can be implemented for realizing all algorithms that have multilayered structures and many different multiprocessor systems were designed such as multisection and two-section pipeline architectures (Dmitrienko et al., 1998). To get the greatest gain in productivity of the pipeline systems, in this work, it is recommended to carry out calculations in specialized processing units (subunit) by entering in ALU structure additional hardware, multiplication units, division units, addition units, subtraction units and cache memory. The function of each subunit is determined: Each of them forms on each i th (i>1) layer a system of Gauss equations for all learning subsample points that are represented as: Solving a system of normal linear gauss Eq. 4 for each of the partial descriptions (2)  where, m1-number of checking subsample points. We can write the system of normal Gauss Eq. 4 for the subunit in the following form: Where: Where: N 1 = Number of learning points i = Number of the layer (i>1) y j = The value of the function in the j th point of the initial data learning subsample The values of best partial descriptions y p(i-1) , y q(i-1) In the j th point in the initial data on (i-1) st layer. The next important function of the subunit is solving a system of Eq. 10: After determining the coefficients (16), (17) we get the model of i th selection layer: which is evaluated on the checking subsample point, by using the following criterion: The design of parallel subunit: We must pass from the formal representations (10-19) of the working algorithm of the subunit to its parallel tear form, which presented by (Voevodin, 1986). Assuming that the parallel system has five processing elements that perform binary multiplication, two of them performing also the division operation in addition to five processor elements that perform the addition and subtraction operations. Then we have: Data:

RESULTS AND DISCUSSION
The first (N 1 +1) tiers form the equation system (10). On the first and second tiers, only the first two summands of the (11-15) equations are computed. On the N 1 st tiers not only corresponding summands are computed, but also summands from the previous tiers added. This addition ends on the (N 1 +1) st tier, corresponding to (11-15) equations and produces k1 k 2 (i 1) (i 1) (i 1) (i 1) 1p 1q 2p 2q y , y , y , y y , y .
− − − − = the tiers from third to N 1−st have the maximum height of the algorithm parallel form and equal 10 and needs for realizing for the algorithm five processor elements, that performs the multiplication operation and five two input addition units. On the first N 1 tiers 5N 1 multiplication operations are performed, for performing 5N 1 multiplication operations the initial data only is required, so with the best performance requirements to the subunit the first (N 1 +1) tiers could be replaced by two. On the first tier all the multiplication operations with the help of 5N 1 multiplication units and on the second performing simultaneous addition of N 1 summands. On the tiers from (N 1 +2) nd to the (N 1 +7) th corresponding to (16), (17) equations, determined the coefficients k k 1 2 a ,a of (4.9), equations that are synthesis on the i th tier of the learning subsample points. On these tiers not more than two processor elements are used, however on the (N 1 +2) nd tier performed two division operations.
On the tiers from (N 1 +8) th to the last(end) determined the mean-square error for the model (18) on the checking subsample points. Forming these tiers we assumed that the number of checking subsample points N 2 is a multiple of 5 (e.g., N 2 = 5 m, m is an integer). This is a general assumption, taking into account that if N 2 ≠ 5 m we can use this parallel form of the algorithm, but some of the processor elements that are used with N 2 = 5 m, will not be used.
With N 1 = 5(m-1) the height of the algorithm of the parallel form can be decreased by 1 because of performing a part of addition operations of (N 1 +3m+9) th and (N 1 +3m+10) th tiers on the N 1 +3m+8 tier. We mention that when it is necessary we can perform all multiplication operations that are related to the values of the model (18) on the checking subsample points on the tiers (N 1 +8) th -(N 1 +3m+6) th could be performed on one tier, but 2N 2 processor elements are required. With the availability of additional N 2 3-input addition units that are necessary for computing the values 1 2 N2 , ,..., δ δ δ and N 2-input addition unit to produce the sum N 2 j j 1 = δ ∑ the last 3m+5 tiers can be replaced by 5 tiers. The minimal height of the algorithm parallel form with the account of two-tier exchange of the first (N 1 +1) tiers will be 13. But, the load of such system will be very small, because of the use of not more than two processor elements on the seven tiers of the algorithm.
The number of addition and multiplication/division operations on each tier for an algorithm of N 1 +3m+12 height is shown in Table 1.
With N 1 = N 2 = 15 we have 36 tiers of the parallel algorithm. On these tiers 240 operations are performed, 127 multiplication and division operations and 113 addition and subtraction operations.
From the parallel computing of multiplication and addition operations follows the necessary of timing both operations using two types of computing devices.
However stands another question about implementing one or two universal mul/div devices or multiplier with built-in one or two divisors. By entering one divider, the number of tiers increased by 1, so the two division operations on the (N 1 +2)-nd tier will be computed sequentially. But entering two dividers, one of them will be used only one time on the (N 1 +2)-nd tier and the second four tiers. This selection must be based on the requirements to the problem-oriented computing devices, taking into account both the performance and the cost of the system. In this work we use two universal mul/div devices, which need approximately the same time to perform both operations.
In this work the largest time needed for multiplication and addition operations and the addition operations also needs the least time, then the selection of optimal multiplication devices directly Affect the subunit characteristics. Learning the available multipliers shows that: Learning the available multiplication and division units (Kung, 1991;Veshinchouk and Cherkasky, 1990) shows that the best by the means of performance are matrix ones (Kung, 1991;Veshinchouk and Cherkasky, 1990). So some tiers perform only addition operations (Table 1a, b, and c) the best adders are the parallel ones (Gex, 1971;Saveliev, 1987).
Knowing the times τ mul , τ div , τ add (τ mul ≈ τ div = τ, τ add <0.1τ) needed to perform the arithmetic operations(multiplication, division and addition) in parallel subunit, we can evaluate its performance and the load coefficients of processing units according to their functionality algorithm.
Form parallel-tier algorithm of the subunit and Table 1a, b, and c we can see that, the computational units of the subunit have to perform not less than five addition operations at the time τ. In this case the total number of algorithm tiers n can be calculated by the following expression: On the first N 1 +7 tiers performs 5N 1 +6 multiplication and division operations and 5(N 1 -1)+3 addition operations on the tiers from N 1 +8 to n-3N 2 +1 multiplication and division operations and 2N 2 +10 addition operations. Ignoring other operations we can calculate k per -performance coefficient of the parallel subunit in comparison with serial computers.:      The numerical characteristics of the parallel subunit, that performs on each algorithm tier not less than five multiplication operations or two division operations and five addition operations with different values of N 1 ,N 2 and add / τ τ are shown in Table 2.
The first seven rows in Table 2 show N add , N mul numbers of addition and multiplication or division operations, n total number of algorithm tiers, n mul number of algorithm tiers with addition operations corresponding to N 1 , N2-numbers of learning and testing subsequence points.
The five rows represent k mul , k add -the load coefficients of multipliers and adders, also T ser , T SUBUNIT relatively algorithm execution time in serial and parallel computing devices and k per -performance coefficient of the designed subunit in comparison with serial devices.
From the coefficients values K mul , K add , K per , shown in Table 2 with different τ add /τ relations, various N 1 , N 2 , numbers, it is clear that multipliers play the most important role in maximizing the subunit performance. The adder is loaded less than half time with τ add /τ = 0.1 and in 20% greater with τ add /τ = 0.05, so increasing the number of parallel multipliers to ten or to twenty, the needed 10 or 20 addition operations on the time τ, could be performed by one adder, the final structure of the parallel subunit is shown in Fig. 2.

CONCLUSION
Structural synthesis technique of specialized computing devices by carrying out parallel calculations is developed at hardware-software realization of selforganizing algorithms at a level of separate mathematical models. This technique can be used for synthesis specialized computing devices for any known functional-oriented computing systems for data processing by a group method of data handling.