Ever wonder how services like Netflix or Pandora choose media to suggest to you? If you’ve been reading this blog for a while, you’re familiar — at least a little bit — with recommender engines. In our post “How Machine Learning Will Affect Your Next Vacation,” we talked about the impact machine-learning recommender engines have on regular consumers. But here, we want to dive deeper and talk about the math and science behind recommender engines.
First, for those of you less familiar, recommender engines drive everything that delivers an item or a suggestion based on your previous behavior as well as a number of other factors. This item or suggestion could be a song, a movie, a coupon, a person to connect with on social media — anything at all. And every engine requires a unique set of techniques and variables to make it run properly.
What seems like a huge web to non-scientists is actually a large mathematical matrix that needs to be filled in. Scientists do this by applying knowledge-discovery techniques to the challenge of making personalized recommendations. More specifically, they use collaborative filtering, which forms the basis of many popular recommender algorithms. They also use Hadoop MapReduce, which is the contemporary means by which large amounts of data are processed for recommenders to make accurate and timely predictions. These two invaluable tools are explained in depth below.
Collaborative Filtering to the Rescue Collaborative filtering is a popular methodology used with recommendation engines. The basic idea behind collaborative filtering is predicting the interests of one user by collecting and analyzing the interests of many users. Suppose we sell groceries online and have three users of our website. We wish to make recommendations of fruit items for User 3. First, we need to find a group of other users who possess likes and dislikes that are similar to User 3. Once identified, this group is called a neighborhood. Then we look at other fruit that is liked within the neighborhood and recommend those items to User 3.
Collaborative Filtering to the Rescue
Collaborative filtering is a popular methodology used with recommendation engines. The basic idea behind collaborative filtering is predicting the interests of one user by collecting and analyzing the interests of many users. Suppose we sell groceries online and have three users of our website. We wish to make recommendations of fruit items for User 3. First, we need to find a group of other users who possess likes and dislikes that are similar to User 3. Once identified, this group is called a neighborhood. Then we look at other fruit that is liked within the neighborhood and recommend those items to User 3. This process is illustrated in the figure below.
User 1 purchased an orange, an apple, and a banana. User 2 purchased an orange and a banana. And User 3 purchased an apple only. To recommend an item to User 3, we calculate that using the fact that users who purchased an apple have also purchased an orange in common, so we should recommend an orange to User 3.
While this is the basic premise, companies need a mechanism that can accomplish this across millions of users. They do this with a data structure called the user-item-matrix, which allows them to find relationships between users and items and present an ordered list of recommendations to the user. In real life conditions, companies pull large volumes of historical data into the matrix to improve the accuracy of the recommendations.
The crux of collaborative filtering is to identify the neighborhood using the concept of similarity among users in terms of their behavior (e.g. selecting movies). User behavior forms the basis of what to recommend. Here, a statistical learning algorithm called k-nearest neighbor (KNN) performs this determination. KNN uses a distance metric to measure similarity. Common metrics include Euclidean distance, Manhattan distance, and Jaccard similarity coefficient.
Hadoop-Based Recommenders Provide the Power
One popular way to implement a recommender engine is to use an algorithm from the open source, scalable, machine-learning library Apache Mahout, which is implemented on top of Apache Hadoop and uses the MapReduce distributed processing paradigm. We can use Mahout’s matrix algebra functionality to get us from user-behavior histories to useful indicators for recommendation. For this purpose, we build three matrices (i.e. data structures): the user-item matrix, the co-occurrence matrix, and the indicator matrix. As mentioned above, the user-item matrix contains pairs of users and items, which is then converted to an item-by-item matrix called the co-occurrence matrix (explained and demonstrated below). The indicator matrix is the end result that serves as the basis for recommendations.
Mahout’s MapReduce uses a co-occurrence matrix for the concept of “item similarity,” which provides some degree of similarity between items. The idea is that the more frequently any two items co-occur, the more similar they are likely to be. In essence, this means we are calculating the number of times a pair of items occurs together per user. We can use a statistical test in Mahout called the log likelihood ratio test to determine which co-occurrences cross an anomaly threshold to the degree required to be considered as indicators. Mahout performs this task in parallel and thus highlights the power of MapReduce.
Once we know a bit about how often they co-occur, we can then predict the user’s preference toward a small subset of other items the user has not seen before. Our goal for machine learning is to predict how likely an item is to be purchased based on the information gathered from that user, other users, or both. In the case of Mahout’s distributed recommender, we multiply the user vector (as column vector) by the co-occurrence matrix. Then we add the products of each of those calculations to come up with a recommendation score.
For instance, in the chart above, we’ve created a co-occurrence matrix for a movie recommender. The goal is to identify movies to recommend to Daniel. In the matrix, we’ve listed four movies, one of which Daniel has already purchased (The Monuments Men, as indicated by the “1” in the “Daniel’s Purchase History” column). We’re comparing the remaining three to predict which one he’d likely purchase based on their similarity scores to The Monuments Men. To do this, we multiply the numbers in Daniel’s Purchase History by the numbers in the matrix (which were provided by Mahout’s distributed recommender) according to their order (left to right for the matrix, and top to bottom for Daniel’s Purchase History). The products of those calculations are then added together to create a recommendation score (so for American Hustle, we’re adding 0 + 0 + 3 + 0 = 3). American Hustle has the best score, so that is the recommendation.
While you’re not quite ready to become a data scientist just by reading this blog post, if you’ve tracked all the way to this point, you’re definitely closer than you were before. And recommender engines are an excellent place to start, as they’re one of the most popular ways companies use machine learning to improve business. Whether you’re a budding data scientist, a business executive looking to gain a better understanding of the science behind the tools you’re using or considering, or even just a regular consumer, it behooves you to have a good grasp on how products, services, media, and more are pushed to end users. Recommender engines will only become more powerful and accurate as time goes on.
Want to learn more about recommender engines? Read our case study about how we developed one to help Expedia recommend hotels to its users.
Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics.