Weasel Ware Evolution Simulation - The Evolutionary Informatics Lab

Weasel Ware - Evolutionary Simulation

Introduction

Unassisted search will not work for even moderately sized problems [1, 2]. Monkeys at a typewriter eventually generating the complete works of Shakespeare is simply impossible. Simple combinatorics show the universe as modeled by science today is not sufficiently old nor big enough to support the search for even a hand full of pages [3, 4]. Neither are the 10¹⁰⁰⁰ parallel universes hypothesized by M-theory [1].

Rather than illustrating the evolutionary reproduction of all of the works of Shakespeare, Professor Richard Dawkins opted for a single phrase from Shakespeare's Hamlet [5].

METHINKS*IT*IS*LIKE*A*WEASEL

The phrase has L = 28 letters from an alphabet of N = 27 characters: 26 letters and a space. The space is shown here with an asterisk.

After admission that random search will not work, Dr. Dawkins proposes an algorithm rich in active information that guides the search to a quick success. In evolutionary computing, active information must be imposed on the search algorithm to allow even moderately sized search algorithms to work. In evolutionary computing, as in Dr. Dawkins' example, active information is supplied by the programmer. The principle of conservation of information requires it [2].

From Whence the WEASEL Cometh?

To illustrate evolution, Dawkins starts with the phrase [5]

New phrases 'breed from' this random phase. "It duplicates it repeatedly, but with a certain chance of random error - 'mutation' - in the copying." Every 10th was recorded and the results were [5]

The target phrase,

was reached in 64 generations.

Without a detailed explanation of the algorithm by Dawkins, the interpretation of the procedure for generating these phrases is open to interpretation. Notice, for example, once the correct letter appears, it stays in place. This led Dembski and Marks [2] to conclude that Dawkins was using a partitioned search. Once a letter is identified, it is locked into place for the remainder of the search. Only one child was used per generation and only those letters not yet correctly identified were mutated.

This interpretation of Dawkins's simulation was, however, wrong. Concerning generations, Dawkins writes [5]

The computer examines the 'progeny' of the original phrases, and chooses the one, however slightly, most resembles the target phrase...

Since "phrases" is plural, Dawkins uses more than one child per generation.

So what was the search algorithm used by Dawkins? Unfortunately, Richard Dawkins no longer has the code he used for the simulation. Copies of the code were apparently distributed to others. Denyse O'Leary, author and blogger extraordinaire, initiated a contest to find the original code [9]. Over 375 responses were received, but no definitive code was identified. There was general consensus that, rather than partitioned search, Dawkins used a Hamming oracle [11].

In response, Ewert, Montañez, Dembski and Marks published a paper where various implementations of Hamming oracle solutions to the WEASEL problem are analyzed [11]. They also proposed a deterministic search algorithm, FOOHOA, that outperforms the evolutionary search. You can examine them all with the Weasel Ware Simulation.

The mathematical details for the partitioned search are here [10] and those for the Hamming oracle are here [11].

Ewert et al. [11] also proposed a "search for the search" for the optimal algorithm to extract active information from the Hamming oracle.

The bottom line in all these implementations is the same. The source of the active information in the WEASEL problem is not the evolutionary algorithm. The source of active information is the oracle - either the oracle used in the partitioned search or the Hamming oracle. In fact, the evolutionary search uses the oracle poorly. On a per query basis, active information is seen to be extracted much more efficiently using other means.

Are we sure one of these is the algorithm used by Dawkins? No. But we've covered a lot of bases. If you runs across Dawkins's original code, please let us know.

Oracles

An oracle is a source of active information [3]. Asking an oracle a question is a "query". Oracles, when queried, provide information rich responses.

The game of 20 Questions illustrates the idea of an oracle. The 20 Questions oracle, who has some specific animal, vegetable or mineral in mind, is queried with questions answered with a "yes" or a "no". The search is complete when the object the oracle has in mind is identified.

Answers to queries typically come at a cost. In 20 Questions, each query is valuable in the sense that, after the twentieth question, you have spent all of your queries. In evolutionary computing, answering queries usually uses most of the computing resources. The cost here is thus computer time. The number of queries required for a successful completion to a search is a measure of how much active information is provided by the oracle and how wisely the questioner has used the oracle.

Some oracles are smarter than others. For the WEASEL example, we consider three oracles. The first is a stupid "needle in a haystack" oracle that, when presented with a 28 character string of letters will respond either "no", the character string is not the weasel phrase or "yes", the phrase is the correct answer. If the phrase has only one incorrect letter, the oracle responds with a "no". The oracle provides no indication of whether or not we are close in some sense to the right answer.

A smarter "divide and conquer" oracle, when presented with a string of letters, returns the phrase submitted to it with some of the letters colored red. For example


						MPOIUYTRCDDXDSPONHUZARTYUXEL

means "the five letters that are red are correct. The others are wrong." The divide and conquer oracle can provide significantly more active information than a needle in a haystack oracle [6].

The third oracle, a Hamming oracle, is examined below.

An oracle can be used in different ways, some more wisely than others. In the game of 20 Questions, for example, an initial query of "Is it the missing right arm on Michelangelo's statue of David?" will, on average, generate a lot less information that the question "Is it vegetable?" [7].

Let's imagine a search where letters are first chosen randomly. For each letter, we can envision spinning a roulette wheel and randomly selecting a letter. Once a letter hits at a location, we keep it. For incorrect letters, we spin the roulette wheel again. We will call this use of the divide and conquer oracle a "partitioned search" [1].

A wiser use of the divide and conquer oracle is "deterministic search." Here, we step through the alphabet one letter at a time. We submit the letter A to the oracle and the oracle tells us all the places in the phrase containing the letter A. Next, we query for the letter B and identify all of the places in the phrase that contain B. Then C, D. ... all the way through Z and the space, * . We are guaranteed to identify the phrase in no more than 27 queries [8].

The deterministic search reveals the enormous information available from the divide and conquer oracle. Using a deterministic search, every letter and space in all of Shakespeare's works can be identified in 27 queries. All of the letters in all of the books in the Library of Congress can be identified in 27 queries! If we wish to distinguish lower from upper case letters, 26 additional queries are required. If we add numbers, punctuation, and other characters to the search, the alphabet increases to some value of N. Independent of the length, L, of the phrase, all of these characters can be identified in at least N queries.

In an Evolutionary Search such as the one proposed by Dr. Dawkins [5], a population of strings, each L letters long, are chosen randomly. Each string is then presented to the Hamming oracle [11] which assigns the string a rank, based on its proximity to a desired target. (Such a ranking function is generally referred to as a fitness function in evolutionary computing.) The string with the closest proximity to the target is selected and chosen as the parent of the next generation of strings, where each child string contains a certain number of mutations. The process is repeated until the target is reached. (In the simulation, we call such an algorithm a "Proximity Reward Search", due to the fact that the rank of each string correlates directly with proximity to target.)

But the rank does not have to do so. For example, we can imagine an Evolutionary Search where rank is based on some internal properties of the strings themselves, such as number of vowels, rather than on external proximity to target. There are many such ways to assign fitness rankings to strings. Some of them are demonstrated in the Simulation GUI and explained in the accompanying documentation for version 2.0. We call these type of searches "Proximity Neutral" searches, since the ranking of strings is done without direct regard to an intentional long-term goal.

As is expected, the perfomance for proximity neutral searching degrades considerably, since we do not provide as much target-specific active information to our search. Simply having replication, mutation, and selection is not enough to ensure convergence on a target; the correct ranking function, specific to a goal target, is also necessary to have a reasonable chance of success. Proximity reward search provides the correct fitness functions needed for the targets (and in doing so provides large amounts of active information), whereas proximity neutral searches typically do not.

We also include "Partially Neutral" searches that do include some target-specific information in chosing ranks for strings, but only applies it in an indirect way. The performance of such searches are better than completely neutral searches, but not as good as proximity reward searches. The improvement in performance is directly proportional to the amount of target-specific active information encoded in and provided by the fitness function.

The probability model for the various searches for

METHINKS*IT*IS*LIKE*A*WEASEL

can be modeled using fundamental properties of probability. For those interested, we have a page that covers the math details. For the first three searches, the median number of queries needed to generate a success can be calculated. For the WEASEL phrase, these are as follows.

Partitioned search = 98 Queries

Deterministic search = 26 Queries

Random Search = 8.30 x 10³⁹ Queries

= 8,300,000,000,000,000,000,000,000,000,000,000,000,000 Queries

The active information supplied by the "divide and conquer" oracle is necessary to perform the search as is the active information encoded in and extracted from the Hamming oracle in Evolutionary Search.

In evolutionary computing, the active information can be generated by a programmer carefully selecting an appropriate fitness function and skillfully querying an information rich oracle.

References & Footnotes

[1] William A. Dembski, No Free Lunch: Why Specified Complexity Cannot Be Purchased without Intelligence, Rowman & Littlefield Publishers, Inc. (2001).
[2] William A. Dembski and Robert J. Marks II, "Conservation of Information in Search: Measuring the Cost of Success"
[3] William A. Dembski and Robert J. Marks II, "Judicious Use of Computational Resources in Evolutionary Search"
[4] William A. Dembski and Robert J. Marks II, "Horizontal and Vertical No Free Lunch for Active Information in Assisted Searches"
[5] Richard Dawkins, The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design, W. W. Norton, (1996).
[6] Are there even more information rich oracles? Yes. An even smarter oracle, for example, can respond with MPOIUYTTNJDMDSPONHUOALTDUXEL where, in addition to the red letters, the pink letters indicate the letter is one letter off from being correct. It is not difficult to concoct even smarter oracles that, when properly used, complete the search in a single query!
[7] Classic information theory says a query made concerning a given set of answers will generate the largest average information when the chances of getting a "yes" is 50-50. See, for example, Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, 2nd Edition, Wiley-Interscience, 2006.
[8] In the implementation of the deterministic search on the simulation, the search is not done sequentially from A to Z and *. Visualize the characters being printed on 27 cards. The deck is well shuffled and placed face down. The character on the first card is searched first. The character on the second card second, etc. This is the way the deterministic search is performed on the simulation. A fresh shuffle occurs for each implementation.
[9] Denyse O'Leary, "Uncommon Descent Contest Question 10: Provide the Code for Dawkins’ WEASEL Program." Uncommon Descent, August 26, 2009.
[10] Winston Ewert, William A. Dembski and R.J. Marks II, "Evolutionary Synthesis of NAND Logic: Dissecting a Digital Organism," Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics. San Antonio, TX, USA - October 2009, pp. 3047-3053.
[11] Winston Ewert, George Montañez, William A. Dembski, Robert J. Marks II, "Efficient Per Query Information Extraction from a Hamming Oracle," Proceedings of the the 42nd Meeting of the Southeastern Symposium on System Theory, IEEE, University of Texas at Tyler, March 7-9, 2010, pp.290-297.

Usage Instructions

Simulation Description

The Weasel GUI (graphic user interface) has three modes. The first mode is English Phrase. Begin by typing any phrase using only letters and spaces. The number of characters appears below the phrase. For

METHINKS*IT*IS*LIKE*A*WEASEL

there are 28 letters.

English Mode - Weasel Ware Evolution Simulation

When "Start Search" is clicked, the median number of queries as derived on the math page, is displayed. Queries for each search continues until the target is reached or the "Stop" button is clicked.

Simulation Interface - Weasel Ware Evolution Simulation

When the search is stopped, the history of a search can be reviewed by clicking on the "View History" button. Click on the X at the bottom to return to the main display.

Search History - Weasel Ware Evolution Simulation

Unless you are very patient, the Unassisted Random Search will not be successful. In the following display, the median number of queries is 8.30e+39. The number 8.00e+39 is equal to 8 followed by 39 zeros. For a trillion queries per second, this would be achieved in 2.63 billion billion centuries.

Unassisted Random Search - Weasel Ware Evolution Simulation

Unassisted Random Search will work when the number of characters in the search is very very small. Here is a search result using three letters.

Unassisted Random Search - Three Letters

If all three searches are successful, the GUI will display the following message.

The Binary Mode

The "Binary Mode" works the same as the "English Phrase" mode except that the alphabet now has only two letters: 0 and 1. Enter an arbitrary sequence of ones and zeros, and a search will be performed.

The deterministic search will always converge in at most two queries.

The Nucleotide Mode

The "Nucleotide Mode" works the same as the "English Phrase" mode except that the alphabet only four letters. They are A,C,G, and T corresponding to the four nucleotides present in DNA. The deterministic search will always converge in at most four queries.

Additional Features: Version 2.0

For a full explanation of the new features, see the documentation.

Weasel Ware 2.0 now features two additional search modes: Proximity Reward Search and Proximity Neutral Search.

These searches model Evolutionary Strategies, and you can therefore select the number of offspring in your population as well as the mutation rate (in percentage.) A mutation rate of 4.0% will result in roughly one changed letter in each child, for a phrase 28 letters long.

Additionally, you can select a fixed mutation rate of exactly one mutation per offspring by checking the "Fixed Single Mutation" checkbox. This ensures one and only one mutation per child per generation, rather than the probabilistic number of mutations per child based on the "Mutation Rate" percentage.

Proximity Neutral Search: Reward Functions

Proximity Neutral Search allows you to select different neutral and partially-neutral reward functions, as well as offering you the opportunity to design your own fitness functions using javascript and the "Custom Fitness Function" option.

A reward function (usually called a fitness function) is a method of assigning numerical values to strings of letters in order to rank them. For example, we may assign the string LAOETMNPQOKYRUVTPYBYKOAQRWLG a value of 0, and the string MEOETMNPQOKYRUVTPYBYKOAQRWLG a value of 2. We can then rank the two strings according to their numerical value.

The Proximity Reward Search assigns values based on how "far" a string is from the desired target; hence, it is "rewarded" based on mere proximity. However, we may wish to remove this teleological (goal-driven) aspect from our simulation and assign values in a way unrelated or only weakly related to the target sequence. Instead, we could assign values based on internal properties of the strings themselves, unrelated to the target string.

Partially Neutral: Anagram allows you to rank the strings based on how far away they are from being an anagram of the target. Since this does use some target specific information in assigning values, it is designated "Partially Neutral". Note, to have a successful search using this method, you will need to choose a short string for your target (unless you're very patient) and have a mutation rate that allows for at least two letters to be changed at a time (allowing for the swapping letters within an anagram.)

Partially Neutral: Simple Sum Distance allows you to rank the strings based on the sum of ASCII values in the string, plugged into a math function and multiplied by the string length to get a value. This value is compared with the target string's value after being run through the same function. This applies a layer between the target string and the information we can get concerning it. We then select based on the string with the closest value to the target's value.

Partially Neutral: CRC32 allows you to rank the strings based on the CRC32 checksum of the string, again plugged into a math function and multiplied by the string length to get a value. This value is compared with the target string's value after being run through the same function. This also applies a layer between the target string and the information we can get concerning it, and we again select based on the string with the closest value to the target's value.

Simple Sum allows you to rank the strings based simply on the sum of ASCII values in the string. We do not take into account the ASCII sum of the target, nor anything else concerning the target; we simply select the string with the lowest normalized ASCII sum. Therefore, fitness is not correlated to any future goal we might have.

CRC32 allows you to rank the strings based solely on the normalized CRC32 checksum of the string. Again, for this search method we do not take into account any target (goal) related information. By removing the teleological aspect of the search, we find that search perfomance declines to the point of failure, even though we still have replication, mutation and selection in operation. Doing this allows us to measure the relative contribution of selection aspects (replication, mutation, selection) to search performance versus the contribution of designed teleological aspects (ranking values based on mere similarity to a future desired goal.) We see that selection does little in the absence of teleologically chosen fitness assignments.

Wave Interference allows you to rank the strings based on the interference pattern created by two sine waves of different wavelength and phase. The lowest value represents the best string. Again, this method uses replication, mutation and selection, but does not use any information concerning the desired target (except for the normalization process.)

Custom Fitness Functions

In addition to the above mentioned functions, you can test any other fitness function you can come up with by using the Custom Fitness Function mode. A button labeled "Edit Custom Code" appears when that mode is selected, where users can enter valid javascript to assign two values used for comparison: aError and bError. Whichever has the lowest value becomes the "fittest" string, and is selected.

Click the "Edit Custom Code" function to open the custom code edit window.

You have access to any standard javascript function and can also code your own functions, as the default examples show. (The default code loaded is a javascript snippet for a Proximity Reward function.) The two strings you will compare are available as variables "a" and "b" in the javascript code, and the target string is simply the variable "target".

Custom Functions - Proximity Reward - Weasel Ware

You must hit "Update" to save the changes to your code. Hit "Cancel" to close the custom function window without saving any changes (or just close the window without hitting Update.)

Your custom code will be evaluated using eval() and if an exception occurs it will be trapped and a message displayed in red text. (On exceptions, aError and bError are both assigned a fitness value of 0.)

For additional tips on creating custom fitness, click on the "Tips for Creating Custom Fitness Functions" link on the bottom of the custom code edit window.

Multi-Run Mode

The new Multi-Run Mode allows you to run the simulation many times in a row automatically, in order to get information about average performance and to run experiments of your own. To begin, you enter in the box the number of times you want the simulation to run for (1000 is the default) and then click the "Mutliple Runs (Start)" button. You can pause the batch run at any time by hitting the "Stop" button and can reset the entire batch by clicking the "Reset" button.

Note, a "run" is a full search, which may result in many queries for each single run (iteration.) If you include Unassisted Random Search by not disabling it, you may never be able to go through a single run (since you will not find the target using that search, for sufficicently long strings.) A run must successfully find the target for every non-disabled search before it can move on to the next iteration.

After your runs are complete, you can click the "View Results" button to see summarized results for each run.

The columns in the displayed table are the recorded performances for each type of search enabled. For example, in the above table Deterministic Search found the target in 22 queries for the first run, 25 queries for the second run and 27 queries for the third run; the Proximity Reward Search took 1,750 queries for the first run and 950 queries for the third run.

The Mathematics

First, let's look at the partitioned search. Assuming uniformity, the probability of successfully identifying a specified letter with sample replacement at least once in Q queries is

The probability of identifying all L characters in Q queries is therefore

Solving for

gives

Setting

gives the median number of queries required for success using partitioned search. This is the number displayed on the Weasel GUI under partitioned search. If we ran a large number of simulations, half the number of queries needed to achieve success would be above the median and half below.

For L = 28 and N = 27, we obtain a median number of queries for partitioned search equal to

For example, the display below lists this number as 98.

For unassisted random search, the probability of success in Q queries is

Solving for

gives

The number

however, can be so small that most computers will calculate

as being identically zero. We thus find useful the approximation that, for

we can write

The number of queries for unassisted random search then becomes

Setting

gives the median number of queries required for success using unassisted random search. For L = 28 and N = 27, we obtain a median number of queries for partitioned search equal to

For example, the display below lists this number as 8.30e+39.

For deterministic search, the probability of success for a single letter in Q queries is

The probability of that all L letters being correct in Q queries is therefore

Solving for

gives

For L = 28 and N = 27, we obtain a median number of queries for the deterministic search is equal to

In the GUI, this is displayed as 26 queries.

In the binary and nucleotide modes, the same formulas are used except that N = 2 and N = 4, respectively.

Search Type:	English Phrase Binary Nucleotides (ACGT)
Search Phrase:
Number of Characters:	0

Current:
Median # of Queries:
History:	Disabled:

Current:
Median # of Queries:
History:	Disabled:

Current:
Median # of Queries:
History:	Disabled:

Current:
Options:	Offspring: Mutation Rate: % Fixed Single Mutation:
History:	Disabled:

Evolutionary Informatics

Investigating how information makes evolution possible

Projects

Quote

Weasel Ware

Weasel Ware - Evolutionary Simulation

Introduction

From Whence the WEASEL Cometh?

Oracles

References & Footnotes

Weasel Ware 2.0

Usage Instructions

Simulation Description

The Binary Mode

The Nucleotide Mode

Additional Features: Version 2.0

Proximity Neutral Search: Reward Functions

Custom Fitness Functions

Multi-Run Mode

The Mathematics

Current:
Reward Function:
Options:	Offspring: Mutation Rate: %
History:	Disabled: