In my previous post, cross-entropy is presented as a *cost function*, measuring the difference between given inputs and outputs in supervised learning. It is also an important concept from the Information Theory.

Before we talk about what cross-entropy is, let me introduce John, an imaginary freshman entering Westtown. He loves dessert and talks a lot about it. In fact, all he says are four words: ice-cream, chocolate, cookie, and yogurt. When he is back home after school, however, John only uses binary code to text with his classmates. So, the texts his classmates receive look like this:

To understand what the messages mean, John and his classmates need an established code system, a way to map sequences of binary bits into words. Here is a simple example:

With the code system in hand, John and his classmates can encode text messages by simply substituting words for codewords and vice versa.

It turns out that John does not use all of his four words as often. As an ice-cream fan, John shares his ice-cream-loving moments all the time. He sometimes mentions his time eating chocolate ice-cream with his family, and rarely does he mention anything else. So, his word frequency chart looks like the following:

Now we can plot out John’s word frequency against the number of binary bits each symbol matches. The area formed represents John’s average length of messages we use to send each word. These areas are formally named *entropy*.

Since typing and sending text message takes up time for John and his classmates, they want to reduce average codeword length for each word so they could spend minimum time texting the same amount of words. This is when *variable-length code* comes into play, mapping commonly used words (like “ice-cream”) to shorter codewords and less-frequently used ones to longer codewords (like “cookie” or “yogurt”). So, we have a new mapping between words and codewords:

Side note: the mapping is not randomly picked — it is designed as a function of the word frequency so it is uniquely decodable and not cause confusion when splitting the message into codewords.

Again, we could plot the word frequency against the number of binary bits. Here is how it looks like:

The result is amazing: we successfully reduce the average length to 1.75 bits! It will be the most optimized code in this case. Encoding words with it will take up a minimum number of bits and the text messages will be the most concise.

Very soon after the school started, John meets Juliet, another freshman at Westtown. Juliet is a chocolate-lover. She talks about chocolate all day, and her favourite is chocolate chip cookies. Juliet despises ice-cream though, and mentions it only when necessary. Despite this, they share their obsession with dessert and, interestingly, the same limited vocabulary size.

When Juliet started to use John’s code, unfortunately, the text messages she sends are much longer than John’s, since they have different word frequencies. As we plot Juliet’s entropy graph, we could see her average message length is as long as 2.25 bits! We call Juliet’s average message length using John’s code system the *cross-entropy*.

So, why do we care about cross-entropy in supervised machine learning? Well, cross-entropy provides us a way to measure the difference between the result our model produces and the provided outcome. Since both the result and the provided outcome could be both expressed in the form of frequency charts or possibility charts, cross-entropy fits its role as the cost function well. With the cost calculated, we could adjust the parameters to achieve a best-fit model for the given outcomes.

See you next week!

Work Cited

“I like this Maple Application – Vibration of Mindlin rectangular plates.” *Vibration of Mindlin rectangular plates – Application Center*, http://www.maplesoft.com/applications/view.aspx?SID=35302&view=html.

What is bit (binary digit)? – Definition from WhatIs.com. (n.d.). Retrieved October 14, 2017, from http://whatis.techtarget.com/definition/bit-binary-digit