Technical Article

# Evolving a Sorting Program and Symbolic Regression

December 24, 2019 by Greg Schmidt

## Learn about symbolic regression in genetic programming, one of the breaking fields of research for algorithm development that can be used in the control systems of tomorrow.

Genetic programming represents a revolution in algorithm development wherein systems could literally author themselves. In this series, Greg Schmidt, a software engineer that develops products for industrial control systems, goes over the terminology of genetic programming and examples of how it could work.

Catch up on the rest of the series here:

In this article, we'll go over a simple sorting program and how to evolve it according to the principles of genetic programming.

### Evolving a Sorting Program

We would like to evolve a simple sorting program capable of sorting the values within a four element integer vector into ascending order.

We begin by choosing an appropriate function and terminal set.

#### The Conditional Exchange (cexchg) Function

Since we will need to access each element of the vector while performing a sort, the terminal set will reasonably consist of the set of possible integer indices {0, 1, 2, 3}.

We define a single function named cexchg (conditional exchange). The conditional exchange function accepts two arguments which are treated as indices into the integer vector and thus refer to some pair of values within the vector.

A conditional exchange function is evaluated by comparing the value referenced by the first argument with the value referenced by the second argument. If the first value is less than the second value, the two values are exchanged; otherwise no action is taken.

The closure property is enforced by arbitrarily defining the return value of the conditional exchange function to be the value of its second argument. Thus evaluating the first "cexchg" in the expression (cexchg (cexchg 1 2) (cexchg 3 4)) is equivalent to evaluating (cexchg 2 4)

#### Fitness Function and Termination Criterion

To complete our problem specification, we must specify a fitness function and the termination criterion.

The zero-one principle tells us that if a comparison sort is capable of sorting all 2n binary sequences of zeros and ones into nondecreasing order, it will sort any arbitrary sequence of n numbers into nondecreasing order (in actuality only 2n - 2 sequences are required, since the two sequences consisting of all zeros and all ones can obviously be eliminated).

This principle can be used to our advantage in constructing a fitness function. We therefore choose the sequences of zeros and ones represented by the binary numbers within the range of 1 to 14 to be our set of fitness cases.

Our fitness function shall be the number of correctly sorted vectors an individual produces when applied to each of the 14 test cases. A perfect score of 14 attained by any individual yields an appropriate termination criterion.

Listing 3 shows a perfect solution.

A perfect solution to the sorting problem.

(defun program  ()
(cexchg   (cexchg 3
(cexchg 0 1)
)
(cexchg (cexchg 0 2)
(cexchg 3 0)
)
)
)

Sorting Example:
Evaluate the above program on the input vector v = [4, 3, 2, 1] to put the elements in ascending order.



 Condition Values compared Exchange? New vector v(0) < v(1) (4 < 3) No [4, 3, 2, 1] v(3) < v(1) (1 < 3) Yes [4, 1, 2, 3] v(0) < v(2) (4 < 2) No [4, 1, 2, 3] v(3) < v(0) (3 < 4) Yes [3, 1, 2, 4] v(2) < v(0) (2 < 3) Yes [2, 1, 3, 4] v(1) < v(0) (1 < 2) Yes [1, 2, 3, 4]

### Symbolic Regression

In a symbolic regression problem, we have a set of vectors of the form [Y,X1,X2,X3...Xn], consisting of observed data for a dependent variable Y and the n independent Xi variables. We wish to find a mathematical equation that correlates the dependent variable with the independent variables.

Standard regression analysis techniques require the form of the regression equation be specified in advance and is commonly restricted to a linear equation of the form:

Y = C1 X1 + C2 X2 + C3 X3 + ... + Cn Xn

Regression calculates a set of coefficients (C1, C2, C3, ... Cn) such that the sum of the squares of errors between the calculated and the observed value for the independent variable Y, for all vectors, is minimized.

The goal of symbolic regression is similar to that of standard regression in that an equation correlating the dependent variable to the independent variables with minimum error is desired. This approach incorporates the sum of the error across all vectors into the fitness function.

#### Benefits of Symbolic Regression

An advantage of symbolic regression is that the form of the regression equation need not be specified in advance. We need only specify the allowable mathematical operations as our function set and the dependent variables as our terminal set, and rely on artificial evolution to discover an appropriate representation of the regression equation. The lack of restriction on form provides us with the potential for discovering solutions with a variety of interesting nonlinearities that may aid in finding good solutions.

#### Drawbacks of Symbolic Regression

One problem which arises from the unrestricted form is that we may inadvertently divide by zero or take the square root of a negative number. This can be rectified by defining "protected" versions of these operations.

For example, we may choose to define protected division as returning the value of one if the divisor is zero. Protected square root may be defined as returning the square root of the absolute value of its argument.

#### Random Constants

Also, the notion of random constants can aid the process. The idea is to provide a special terminal that creates a new random constant, within some predetermined range, when it is first placed into an individual. Thus, a variety of random constants are introduced during the initial population and through new subtrees generated via mutation. As new expressions are produced, existing constants may combine in novel ways to form new ones as is the case for the expression (*5.137 4.621). It is interesting to note that, in effect, useful constants may be produced even though this technique is not used. An example of this is effect is the constant two produced by the expression (/ (+ X X) X).

Suppose we wish to discover a polynomial approximation to sin(x) within the interval (0 <= x <= pi/2). Our terminal set contains the dependent variable x and random constants within the interval (-1, 1). The function set consists of the mathematical operations {+, -, *, /} where '/' is protected division. Fitness is defined as the sum of the errors between the actual and calculated value of sin(x) for an individual across twenty evenly spaced points within the interval. Termination will occur after fifty generations. Listing 4 shows one approximation discovered. Note that only the functions {+, *} appear. The evolutionary process often discovers an appropriate solution that does not require the full expressiveness provided by the complete function and terminal set.

A polynomial approximation to sin(x) in the interval (0 <= x <= pi/2)

(defun program ()
(+ (* (* (* X X) X)
(* 0.2283 -0.6535))
X)
)

Simplifying the above program yields the following equivalent equation:

polysin(x) = -.1492 x3 + x

Results:

 x sin(x) polysin(x) 0.000 0.000 0.000 0.078 0.077 0.077 0.156 0.155 0.155 0.234 0.231 0.232 0.312 0.306 0.307 0.390 0.380 0.381 0.468 0.451 0.452 0.546 0.519 0.521 0.624 0.584 0.587 0.702 0.645 0.650 0.780 0.703 0.709 0.858 0.756 0.763 0.936 0.805 0.813 1.014 0.848 0.858 1.092 0.887 0.897 1.170 0.920 0.931 1.248 0.948 0.957 1.326 0.970 0.978 1.404 0.986 0.991 1.482 0.996 0.996 1.560 0.999 0.993
##### Listing 4. Symbolic regression

In this article, we saw an example of symbolic regression as it behaves in genetic programming. In the final installment of this short series, Schmidt will discuss the practical considerations and applications of genetic programming. 