Category Archives: Math and Technology

Moving below the surface (2): Cross-entropy — William

In my previous post, cross-entropy is presented as a cost function, measuring the difference between given inputs and outputs in supervised learning. It is also an important concept from the Information Theory.

Before we talk about what cross-entropy is, let me introduce John, an imaginary freshman entering Westtown. He loves desserts and talks a lot about it. In fact, all he says are none but four words: ice-cream, chocolate, cookie, and yogurt. When he is back home after school, however, John only use binary code to text with his classmates. So, the texts his classmates received looks like this:


To understand what the messages mean, John and his classmates need an established code system, a way to map sequences of binary bits into words. Here is a simple example:


With the code system in hand, John and his classmates can encode text messages by simply substituting words for codewords and vice versa.


It turns out that John does not use all of his four words as often. As an ice-cream fan, John shares his ice-cream-loving moments all the time. He sometimes mentions his time eating chocolate ice-cream with his family, and rarely does he mention anything else. So, his word frequency chart looks like the following:


Now we can plot out John’s word frequency against the number of binary bits each symbol matches. The area formed represents John’s average length of messages we use to send each word. These areas are formally named entropy.


Since typing and sending text message takes up time for John and his classmates, they want to reduce average codeword length for each word so they could spend minimum time texting the same amount of words. This is when variable-length code comes into play, mapping commonly used words (like “ice-cream”) to shorter codewords and less-frequently used ones to longer codewords (like “cookie” or “yogurt”). So, we have a new mapping between words and codewords:


Side note: the mapping is not randomly picked — it is designed as a function of the word frequency so it is uniquely decodable and not cause confusion when splitting the message into codewords.

Again, we could plot the word frequency against the number of binary bits. Here is how it looks like:


The result is amazing: we successfully reduce the average length to 1.75 bits! It will be the most optimized code in this case. Encoding words with it will take up a minimum number of bits and the text messages will be the most concise.

Very soon after the school started, John meets Juliet, another freshman at Westtown. Juliet is a chocolate-lover. She talks about chocolate all day, and her favourite is chocolate chip cookies. Juliet despises ice-cream though, and mentions it only when necessary. Despite this, they share their obsession with dessert and, interestingly, the same limited vocabulary size.


When Juliet started to use John’s code, unfortunately, the text messages she sends are much longer than John’s, since they have different word frequencies. As we plot Juliet’s entropy graph, we could see her average message length is as long as 2.25 bits! We call Juliet’s average message length using John’s code system the cross-entropy.

So, why do we care about cross-entropy in supervised machine learning? Well, cross-entropy provides us a way to measure the difference between the result our model produces and the provided outcome. Since both the result and the provided outcome could be both expressed in the form of frequency charts or possibility chars, cross-entropy fits its role as the cost function well. With the cost calculated, we could adjust the parameters to achieve a best-fit model for the given outcomes.

See you next week!

Work Cited

“I like this Maple Application – Vibration of Mindlin rectangular plates.” Vibration of Mindlin rectangular plates – Application Center,

What is bit (binary digit)? – Definition from (n.d.). Retrieved October 14, 2017, from

Game Theory in Tennis–Summer


Today, I’m going to explore with you, through the lens of Game Theories, strategies within tennis.

As we know, in a tennis match, players are trying to score points against each other by hitting the ball into a corner of the opponent’s court where s/he is not able to hit it back.

Let’s focus our attention on a pass between two tennis players. The server, player 1,  has two choices: passing towards the receiver’s left (backhand side) or right (forehand side, assuming right-handed). The receiver at net, player 2, also has two choices: leaning left or right to catch the ball. Because each player is differently skilled at backhand and forehand and player’s satisfaction depends on his possibility of scoring/preventing his opponent from scoring, we have the following payoff matrix:

Player 1(left)/2 (top) Left Right
Left 50,50 80,20
Right 90,10 20,80

In this matrix, we cannot see any strictly/weakly dominated strategy or traditional Nash Equilibrium. In fact, this observation coincides with our real world experience: a tennis match is never good if the players always lean to one specific side!

In this tennis game, however, Nash Equilibrium still exists, but in such a Nash Equilibrium both players are playing, what game theory calls, Mixed Strategies. A mixed strategy is a randomization of available strategies that assign each strategy (left/right) a possibility p such that 0‹p‹1 and all p add up to 1. For example, a possible mixed strategy for player 1 is hitting left 20% of the time and hitting right 80% of the time.

The payoff of a mixed strategy is logically given by the sum of the probabilities of individual strategies time their respective expected payoffs against the opponent’s strategy. For instance, if player 1 plays the mixed strategy (0.2, 0.8) and player 2 plays the mixed strategy (0.7,0.3), then the payoff of player 1’s mixed strategy is:


So, given all these definitions, how do we find the mixed-strategy equilibrium of this game? We have to infer a step further.

If the two players are at Nash Equilibrium, then both of them are playing their best responses to the other’s strategy. If a mixed strategy is a best response, then individual strategies within the mix must also all be best responses, or else we can create a better response by taking the strategy with a lower payoff out of the mix. Thus, the only way both hitting left and right are best responses is that their expected payoff  are equal.

Let’s notate player 1’s equilibrium mixed strategy as (p, 1-p) and player 2’s equilibrium mixed strategy as (q, 1-q). Given our inference, if we want to calculate player 1’s equilibrium mix, we can look at player 2’s expected payoffs. Because player 2 is playing a mixed-strategy best response, then as we previously concluded, his expected payoff of leaning right and left should be equal. Thus, we have:

Left   50p+10(1-p)=20p+80(1-p)   Right  

We can solve this equation: p=0.7, 1-p=0.3. Thus, player 1’s mix strategy at Nash Equilibrium is (0.7,0.3).

Similarly, we can solve for player 2’s Nash Equilibrium mixed strategy with player 1’s payoff.

50q+80(1-q)=90q+20(1-q)      q= 0.6, 1-q=0.4

Thus, the game has a mixed strategy Nash Equilibrium at {(0.7,0.3), (0.6,0.4)}.

This model draws heavily on the concepts I’ve introduced in my previous posts and has slightly more calculations and variables than my previous examples. Yet, despite its being slightly more complex, it is by far the most accurate representation of the choices we face in sports and in life.

See you next week!

Work Cited: Basic rules of tennis. (n.d.). Retrieved October 14, 2017, from
Tennis Groundstrokes – The Forehand and Backhand Groundstroke. (2015, October 30). Retrieved October 14, 2017, from
Walker, M., & Wooders, J. (n.d.). Mixed strategy equilibrium (S. N. Durlauf & L. E. Blume, Eds.). Retrieved October 14, 2017, from
Yalecourses. (2008, November 20). Game theory [Video file]. Retrieved from
Image: [Girl’s Tennis]. (n.d.). Retrieved October 14, 2017, from





Moving below the surface: Artificial Neural Networks — William

As mentioned before, we will be discussing artificial neural networks in this blog. Being a significant subject in the field of supervised machine learning, neural networks excel at solving classification problems, and, when combined with convolution integrals, are the most popular model for image classification tasks. Continue reading

The Game Theory of Community Weekend Event–Summer

Today, I’m going to examine with you, in the lens of Game Theory, some of the most memorable times at Westtown, Community Weekend Events!

As we all know, as a boarder at Westtown, we are required to attend four community weekend events each year. In these events, we have the common goal to have a great time and build a tighter community. But, what makes these events fun? Continue reading

Simple Linear Regression in Machine Learning – Kevin

You might remember linear regression from statistics as a method to produce a linear equation that models the relationship between two variables. Not surprisingly, linear regression is quite similar in machine learning, except that the focus is on the prediction rather than the interpretation of data. Regression is a supervised learning algorithm (if you remember from my previous blog) that predicts real-valued output when given an input. In this blog post, I will discuss the model representation of simple linear regression and introduce its cost function.

Continue reading

Wetting the Feet: An Introduction to Machine Learning — William

As we discussed in the previous post, machine learning is one of the main branches of artificial intelligence, in which we aim to build a rational agent. Machine learning is essential to implementation of artificial intelligence, for it allows agents to adapt to different scenarios, as well as predict changes in evolving environment around them. Continue reading

Progress – Cleo

Through the last summer and over the course of the past few weeks, music has become an increasingly huge part of my life. It started for fun and has developed slowly (but not that slowly) into absolute devotion. In this time, I’ve made 8 songs. And as each one passes, I see myself growing tremendously. Just two weeks ago, I decided to bet on myself, I invested in my own work, and, with my mother’s assistance, upgraded my music equipment to a more professional level. Continue reading

Diving Deep Into Deep Learning: Dipping a Toe in the Water — William

Since its birth in mid 20th century, artificial intelligence has been present in various aspects of the popular culture, from Isaac Asimov’s Three Laws of Robotics in Handbook of Robotics and HAL in 2001: A Space Odyssey, to The Terminator and The Matrix, even influencing our way of life in recent years, including Alpha Go and Siri. But what is artificial intelligence? Or in other words, what is the ultimate goal for artificial intelligence? Continue reading

A Brief Introduction to Machine Learning – Kevin

There are two widely accepted definitions of machine learning. The phrase is first coined in 1959 by computer scientist Arthur Lee Samuel, who trained a computer program to play checkers with humans. He later described his work as “the field of study that gives computers the ability to learn without being explicitly programmed.” Decades later, Professor Tom Mitchell coined a more modern and formal definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Continue reading