## A recommendation webservice in 10 minutes

This post will give a preview to a simple to use, configurable webservice for the recommenders of Apache Mahout, which I developed in cooperation with the Berlin-based company Plista.

The project’s name is kornak-api (kornak is the polish word for mahout) and it provides a simple servlet-based recommender application available under the Apache license.

kornak-api aims to be a simple-to-setup and easy-to-use solution for small to medium-scale recommendation scenarios. You configure the recommenders you want to use, populate your database with user-item-interactions and train the recommenders afterwards. None of this requires you to write a single line of Java code!

As an example, this article will show you how to setup a recommendation webservice using the famous Movielens 1M dataset, which contains about a million of ‘five-star’ ratings that 6000 users gave to 4000 movies. After you ingested your database and trained your recommenders, they will be able to recommend new movies for the users in the dataset.

Please be aware that this article assumes that you are (at a general level) familiar with the concepts of recommendation mining and collaborative filtering. If you are looking for a practical introduction to those topics, I suggest you read the extremely well-written book Mahout in Action.

### Prerequisites

Make sure you have Java 6, Maven, git and MySQL installed.

The first thing you should do is checkout the application from github.com/plista/kornakapi via git:
git clone https://github.com/plista/kornakapi.git kornakapi
After that, you need to create a database called movielens in MySQL with two tables. The table taste_preferences will contain all of your interaction data in the form of (user,item,rating) triples. The second table taste_candidates will be used later to allow you to filter and constrain the recommendations for certain use-cases.
CREATE DATABASE movielens;
USE movielens;

CREATE TABLE taste_preferences (
user_id bigint(20) NOT NULL,
item_id bigint(20) NOT NULL,
preference float NOT NULL,
PRIMARY KEY (user_id,item_id),
KEY item_id (item_id)
);

CREATE TABLE taste_candidates (
label varchar(255) NOT NULL,
item_id bigint(20) NOT NULL,
PRIMARY KEY (label,item_id)
);

### Configuration

In this example, we want to use two different recommenders: one that computes similarities between the items based on the way they were rated and another one that uses a mathematical technique called matrix factorization to find highly preferrable items.

To setup those recommenders, you have to create a file named movielens.conf with the following content:
<configuration>

<modelDirectory>/tmp/</modelDirectory>
<numProcessorsForTraining>2</numProcessorsForTraining>

<storageConfiguration>
<jdbcDriverClass>com.mysql.jdbc.Driver</jdbcDriverClass>
<jdbcUrl>jdbc:mysql://localhost/movielens</jdbcUrl>
</storageConfiguration>

<itembasedRecommenders>
<itembasedRecommender>
<name>itembased</name>
<similarityClass>org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity</similarityClass>
<similarItemsPerItem>25</similarItemsPerItem>
<retrainAfterPreferenceChanges>1000</retrainAfterPreferenceChanges>
<retrainCronExpression>0 0 1 * * ?</retrainCronExpression>
</itembasedRecommender>
</itembasedRecommenders>

<factorizationbasedRecommenders>
<factorizationbasedRecommender>
<name>weighted-mf</name>
<usesImplicitFeedback>false</usesImplicitFeedback>
<numberOfFeatures>20</numberOfFeatures>
<numberOfIterations>10</numberOfIterations>
<lambda>0.065</lambda>
<retrainAfterPreferenceChanges>1000</retrainAfterPreferenceChanges>
<retrainCronExpression>0 0 1 * * ?</retrainCronExpression>
</factorizationbasedRecommender>
</factorizationbasedRecommenders>

</configuration>

This file contains all the information necessary to setup the recommendation webservice. The modelDirectory is the place where the recommenders will save their trained models to. Make sure this directory is writable and don’t use /tmp in a real setup. The recommenders will be trained in the background while the application still serves requests and with numProcessorsForTraining, you can configure how many cores should be used for the training. In storage­Configuration the database connection is configured, this should be pretty self-explananory.

Next up is the most important thing, the recommenders! Each recommender has a name, will be automatically retrained after a certain number of new datapoints have been added to the application (retrainAfterPreference­Changes) or at a certain point in time (retrainCronExpression).

In itembasedRecommenders, you can create a couple of recommenders that use item similarity for computing recommendations. You need to set the similarity measure to use via similarityClass, available measures can be found in Mahout’s JavaDoc. Furthermore, you need to tell the recommender how many similar items to compute per item (similarItemsPerItem).

In factorizationbasedRecommenders, you setup the recommenders that use matrix factorization. They need to know whether they data is implicit feedback such as clicks/pageviews or explicit feedback such as ratings (uses­Implicit­Feedback). Furthermore, you need to set the number of features (number­Of­Features) to use, the number of iterations for the training (number­Of­Iterations) and the learning rate (lambda), which you have to determine experimentally.

That’s it with preparation and configuration, now it’s time for the fun part!

### Start the webservice

Go to the directory where you checked out the source code, and fire up a local tomcat server with the following command in which you have to insert the correct path to your configuration file:
mvn -Dkornakapi.conf=/path/to/movielens.conf tomcat:run
You should see the recommender application starting up now.

### Ingestion

Now it’s time to feed the running service with some data! Download and unzip the Movielens 1M dataset from http://www.grouplens.org/node/12. Open a second shell and convert the data to Mahout’s standard input format with the following command:
cat ratings.dat | sed s/::/,/g | cut -d, -f1,2,3 > movielens1M.csv
After that is done, you can push the data file via HTTP POST to the recommender application:
curl -F "file=@movielens1M.csv" http://localhost:8080/kornakapi/batchSetPreferences?batchSize=1000
The data will be inserted into MySQL in a streaming batch manner now. Wait a few moments until you see the following console output:
INFO storage.MySqlStorage: imported 1000209 records in batch. done.

### Training

The recommenders need to be trained before we can finally request recommendations, we kickstart this manually for each recommender by typing these URLs in a browser:

http://localhost:8080/kornakapi/train?recommender=itembased

http://localhost:8080/kornakapi/train?recommender=weighted-mf

Note that the training will be queued in the background and you won’t see any output in the browser. Look into the terminal where you started the application to see the training proceed. The recommenders and their caches will be automatically refreshed after the training has completed.

### Recommendations

Once the training has completed, you can request recommendations by invoking the following URI with the name of the recommender to use, the id of the user to compute recommendations for and the number of recommendations you want to get:

http://localhost:8080/kornakapi/recommend?recommender=weighted-mf&userID=12&howMany=5

It will respond with a simple JSON answer like this:
[{itemID:557,value:5.988698},{itemID:578,value:5.0461025},{itemID:1149,value:4.9268165},{itemID:572,value:4.9265957},{itemID:3245,value:4.8139095}]

### Filtering

It is common for recommendation scenarios that in special situations only certain items should be recommended. Imagine an online shop where products might be out of stock or a forum where posts might be outdated for example.

kornak-api offers a very simple solution for that. It let’s you create so called candidate sets, which you can imagine as sets of itemIDs with a chosen label.

We will create a set called testing and add the item 557 to it:

If we add the label to the recommendation request, only items contained in the candidate set will be returned:

http://localhost:8080/kornakapi/recommend?recommender=weighted-mf&userID=12&label=testing

[{itemID:557,value:5.988698}]

### Outlook

That’s it, I hope you liked the first preview of kornak-api! In the next weeks more information and a detailed description of its API will be posted (and its source code will be commented ).

## Announcement: GameDuell TechTalk

I have been invited to give a TechTalk at GameDuell on the topic of on Large Scale Graph Processing with Apache Giraph.

Anyone interested in attending, can find further infos in the announcement in the gameduell blog and has to register via xing.

## Introducing Apache Giraph for Large Scale Graph Processing

On wednesday I gave a talk titled “Introducing Apache Giraph for Large Scale Graph Processing” at the Berlin Hadoop Get Together. Thanks again to David and Thomas for the organization, it was my second time speaking at this Get Together and its always great to see how well it is organized.

I was giving a high level introduction to Pregel/Giraph, here is the video and the slides for anyone who missed it:

## Giraph and small clusters

Working with Hadoop on a small cluster is really frustrating. It’s very common in research to work with less than ten machines and I always had the impression that Hadoop is simply not aimed at such small clusters. When monitoring my jobs, I always saw that they couldn’t fully utilize the hardware. The Stratosphere guys at my department could even beat Hadoop’s running times by an order of magnitude for some jobs on comparable hardware.

Recently I’ve turned towards using Apache Giraph, a loose implementation of Google’s Pregel system. Giraph ‘abuses’ Hadoop to schedule a non-stopping map-only job, loads the input graph into the aggregate main memory of the cluster and starts its computations afterwards. It’s obvious that this will give a much better utilization of the hardware.

When developing graph algorithms, I usually test them using the wikipedia page link graph (its undirected version has about 200 million edges and 6 million vertices, note that Giraph aims at processing much larger graphs). My cluster has 6 machines with 16 cores and 32GB memory each, so I leave one core per machine for the OS and run 6 * 15 = 90 Giraph workers on my cluster. The result is that executing iterative algorithms using Giraph is about 10 to 15 times faster than applying the same algorithm with Hadoop! Finding all connected components of the wikipedia graph takes less than 5 minutes for example!

Here’s a screenshot of Giraph’s utilization on my cluster (note that cloud-11 doesn’t run a tasktracker):

## Why Apache Giraph is more than a graph processing system

The Apache Giraph project is a fault-tolerant in-memory distributed graph processing system which runs on top of a standard Hadoop installation. It is capable of running any standard Bulk Synchronous Parallel (BSP) operation over any dataset which can be represented as a graph, and is a loose implementation of Google’s Pregel system.

Intuition suggests that Giraph will be usable for graphs only, yet this assumption is completely wrong! A graph is a simple, intuitive and powerful mathematical model, but it is only a model and the underlying data can always also be represented using different models, e.g. by matrices and vectors. I’ll demonstrate the connection between graphs, Giraph and Linear Algebra on the example of a simplified PageRank implementation.

PageRank is a famous algorithm for measuring the authority of the vertices of a graph based on the underlying link structure. In its simplest form, PageRank is defined recursively: each vertex distributes its authority to its neighbors through its edges in equal proportions. In mathematical terms that means that the pagerank $$p_i$$ of a vertex $$i$$ is the sum of the pageranks $$p_j$$ of all vertices $$j$$ that link to it, where each $$p_j$$ has to be divided by the outdegree $$d_j$$ of vertex $$j$$. This relation is captured in the following recursive formula:

$$p_i = \sum_{j \in \{(j,i)\}} \frac{p_j}{d_j}$$

Giraph offers a simple, vertex-centric programming model which makes it super easy to implement PageRank. In each but the first superstep, we recompute our pagerank from the pagerank portions we received from our neighbors. Than we send our pagerank portions out to our neighbors. If a maximum number $$k$$ of supersteps has been executed, we halt the algorithm:

 class PageRankVertex extends LongDoubleFloatDoubleVertex {   public void compute(Iterator<DoubleWritable> messages) {     if (getSuperstep() > 0) {       double pageRank = 0;       while (messages.hasNext()) {         pageRank += messages.next().get();       }       setVertexValue(new DoubleWritable(pageRank));     }

 

    if (getSuperstep() < k) {       sendMsgToAllEdges(new DoubleWritable(pageRank /           getNumOutEdges()));     } else {        voteToHalt();     }   } } 

So far so good, let us get to the Linear Algebra part of the picture now. I claim that from a Linear Algebra point of view, the code above is executing a series of matrix vector multiplications! Would you believe that?

Let us use a simple graph with three vertices as example and walk through the math step by step:

Each graph can be represented by its so called adjacency matrix. A cell $$a_{i,j}$$ of that matrix is 1 if the edge $$(i,j)$$ exists in the graph and 0 otherwise. The adjacency matrix of our example graph looks like this:

$$A = \begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix}$$

We need a slightly different matrix for computing pagerank, the so called transition matrix. A cell $$t_{i,j}$$ of that matrix holds the probability of getting to vertex $$i$$ from vertex $$j$$ if we randomly walked the graph and followed each edge with equal chance. In order to get the transition matrix, we only need to transpose the adjacency matrix and normalize the columns so that they sum up to 1:

$$T = \begin{bmatrix} 0 & 0 & .5 \\ 1 & 0 & .5 \\ 0 & 1 & 0 \end{bmatrix}$$

The transition matrix is a so called Markov matrix. That means we transformed our graph into a Markov chain with the vertices representing states and $$T$$ describing the transition probabilities between them. It can be shown that the largest eigenvalue of a Markov matrix is 1 and that its corresponding eigenvector $$r$$ describes the steady state of the underlying Markov chain (the long term probabilities of being in particular states). This is exactly the vector consisting of the pageranks of all vertices. It satisfies the following equation:

$$Tr = r$$

The easiest way to find the largest eigenvector of a matrix is to use the so called Power Iteration. We create an initial pagerank vector $$r_0$$ where each vertex is reached with equal probability and multiply it by the $$k$$-th power of $$T$$. If $$k$$ is large enough, the result will converge to the pagerank vector $$r$$:

$$r = T^k r_0$$

Let's go back to our example and look at the numbers. We have three vertices, so our initial pagerank vector $$r_0$$ would look like this:

$$r_0 = \begin{bmatrix} 0.33\\ 0.33 \\ 0.33 \end{bmatrix}$$

Instead of computing the $$k$$-th power of $$T$$, it's much cheaper to simply apply $$T$$ $$k$$-times to the vector $$r_0$$.

$$r = T^k r_0 = T(...(Tr_0))$$

Let's take a detailed look at a single multiplication of this algorithm, we choose the first one ($$Tr_0$$) as example. It results in a vector $$r_1$$ that describes the pageranks of the vertices after one iteration:

$$r_1 = Tr_0 = \begin{bmatrix} 0 & 0 & 0.5 \\ 1 & 0 & 0.5 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 0.33\\ 0.33 \\ 0.33 \end{bmatrix}$$

So how exactly do we get to the values of $$r_1$$? The $$i$$-th component of $$r_1$$ consists of the dot product of the $$i$$-th row of $$T$$ and the vector $$r_0$$ which represents the previous pageranks:

$$r_1 = Tr_0 = \begin{bmatrix} 0 \cdot 0.33 + 0 \cdot 0.33 + 0.5 \cdot 0.33\\ 1 \cdot 0.33 + 0 \cdot 0.33 + 0.5 \cdot 0.33 \\ 0 \cdot 0.33 + 1 \cdot 0.33 + 0 \cdot 0.33 \end{bmatrix}$$

Let's have a look at the first component of $$r_1$$ which is identical to the pagerank of vertex 1 after one iteration:

The dot product $$0 \cdot 0.33 + 0 \cdot 0.33 + 0.5 \cdot 0.33$$ tells us the following: In order to get the new pagerank of vertex 1, take zero times the previous pagerank of vertex 1, add zero times the previous pagerank of vertex 2 and half of the previous pagerank of vector 3.

Do you notice that this is identical to the formulation we started with? The vertex implementation we created for Giraph is computing this sparse dot product in each iteration! We never see the first two addends that end up being zero because there are no edges to vertex 1 from vertex 2 and vertex 1.

We can do lots of unsuspected things if we can compute matrix vector products! In fact we can make Giraph load graphs that represents any matrix. I hope this proves my initial claim that Giraph is not limited to graph algorithms but its computational model can be applied to a wide variety of problems. I plan to try my luck on a bunch of the more tricky ones during the next months!

Finally I recommend you to keep in mind what my colleague Jerome chose as the core statement of his doctoral dissertation:

## Running Giraph’s unit tests in pseudo-distributed mode

The Apache Giraph project is a fault-tolerant in-memory distributed graph processing system which runs on top of a standard Hadoop installation. It is capable of running any standard Bulk Synchronous Parallel (BSP) operation over any dataset which can be represented as a graph, and is a loose implementation of Google’s Pregel system.

I’m currently digging into the code and have started to implement some data mining algorithms on top of Giraph. At the same time I’m trying to contribute back fixes for bugs I stumble upon and tools I develop. Giraph is a young project driven by several highly motivated core developers. As in any good open source project, one requirement for getting patches accepted is that you don’t break the unit tests. However, Giraph has the very nice property that a couple of it’s unit tests can also be run on a real hadoop cluster and the developers ask contributors to do this.

It took me a while to get my environment setup for this and I’d like to provide a small guide here, in order to spare others the work of figuring out how to get things setup. I use Ubuntu 9.10 and although I have not tested another operating system, I suppose that this guide should work on any unix-like platform.

Here are the steps:

1. Giraph uses Apache Maven 3 as build tool. Grab apache-maven-3.0.3-bin.tar.gz from a download mirror and unpack it into a local directory.
3. Follow this guide to setup a pseudo-distributed single node hadoop cluster
4. Giraph’s code assumes that you can run at least 4 mappers at once, unfortunately the default configuration allows only 2. Therefore we have to update mapred-site.xml:
5.  <property>      <name>mapred.tasktracker.map.tasks.maximum</name>      <value>4</value>  </property>

 

<property>      <name>mapred.map.tasks</name>      <value>4</value>  </property> 

6. The last piece we need is Giraph’s source code. Check it out out from http://svn.apache.org/repos/asf/incubator/giraph/trunk

When all parts are setup, we can issue the commands to get things started. First we clean up the “distributed” filesystem:
 rm -rf /tmp/hadoop-<username> /path/to/hadoop/bin/hadoop namenode -format 

Then we start the local hadoop instance:
 /path/to/hadoop/bin/start-all.sh 

And finally we run Giraph’s unit tests:
 cd /path/to/giraph /path/to/maven3/bin/mvn clean test   -Dprop.mapred.job.tracker=localhost:9001 

Now you can open a browser, point it to http://localhost:50030 and watch the giraph tests running on your local hadoop instance!

## Apache Mahout in research

I’m currently in Koblenz at the yearly review meeting for the ROBUST research project I’m involved in. The project deals with analyzing large scale business communities and is a perfect fit for applying scalable data mining techniques. I’m very glad that the project sees contributing to open source projects as an important dissemination activity and encourages the use of Apache projects such as Hadoop and Mahout.

The main contributions of me and my colleague Christoph consist of several graph mining algorithms implemented in MapReduce as well as improving Mahout’s collaborative filtering implementation. We also plan to publish a paper about our findings in the next weeks.

In the demo and poster session, we presented Mahout’s collaborative filtering capabilities:

We provided a short terminal demo called ‘Using Hadoop for Collaborative Filtering and Link Prediction’ where we demonstrated Mahout’s parallel item-based collaborative filtering algorithms:

## RowSimilarityJob on Steroids

I’m currently working on improving RowSimilarityJob, one of my first contributions to Mahout. It’s a Map/Reduce job to compute the pairwise similarities between the row vectors of a sparse matrix. While this is a problem with quadratic worst case runtime, one can achieve linear scalability when certain sparsity constraints of the matrix are fulfilled and appropriate downsampling is used.

This is part of my work for the ROBUST research project where this algorithm can be used to find near-duplicates of user posts in forums or to predict missing links in social graphs.

Here’s a picture of my current approach, more details to come:

## A plot of my twitter follower network

Social Network Analysis 7 May 2011 0 Comments

I’m currently diving into the research area of social network analysis. As a first exercise I have mined and plotted my twitter follower network:

## Deploying a massively scalable recommender system with Apache Mahout

Introduction

The purpose of this post is to explain how to use Apache Mahout to deploy a massively scalable, high throughput recommender system for a certain class of usecases.

I’ll describe the shape of usecases covered and give a step-by-step guide to setting up such a recommender system. Be aware that this is a guide intended for readers already familiar with Collaborative Filtering and recommender systems that are evaluating Mahout as a choice for building their production systems on. The focus is on making the right engineering decisions rather than on explaining algorithms here.

If you only want to learn about recommendation mining and try out Mahout, having a look at Mahout in Action might be a more suitable starting point.

Anatomy of covered usecases

• the number of users is orders of magnitude larger than the number of items
• your users are browsing your site anonymously most of the time. More than 95% of all requests to your recommender system have to be answered with similar items to things the users are currently viewing, stuff they’ve put in a shopping cart or their history of “last seen items”
• only a tiny fraction of the users log in and need personalized recommendations
• your item churn rate is relatively low, items are available for weeks or months and it’s ok to have a waiting time of half a day or more until new items are included in the recommendations

Which scenarios or business models do meet these conditions? I’d say most e-commerce sites and a lot of the video portals have similar requirements, so keep those in mind when we continue.

Some numbers

Here are some typical numbers for scenarios that the setup I’ll be describing shortly is able to handle:

• millions of users
• hundreds of thousands of items
• several 100 million user-item-interactions (so called preferences)
• more than 100 requests per second to the recommender system

Algorithmic approach

We’ll use a very common neighborhood-based method called Itembased Collaborative Filtering [1], which also forms the basis for the recommender systems deployed by Amazon and Youtube. Our data will very likely come from implicit feedback of the users such as viewed videos, purchased products, shopping carts or other signals depending on your usecase. Inform your users what data you record and use, this will not only provide transparency for privacy questions, users are also seen to adopt transparent systems more than systems they don’t understand.

The preferences which are our input data consist of simple user-item-pairs where each pair expresses that a user likes an item. From the hundreds of millions of those we need to compute a matrix of item-item-similarities (where similarity does not refer to content but to behavior!) The item-similarities will be used to answer questions like “users who like this, also like …” and are furthermore necessary for computing personalized recommendations.

These item-similarities are relatively static and it has been shown that it’s sufficient to precompute the 20 to 50 most similar items per item offline from time to time.

Have Hadoop do the heavy lifting

We will use Mahout’s ItemSimilarityJob to precompute the item-item-similarities with Apache Hadoop. If you don’t have a cluster of your own, you can temporarily rent the necessary infrastructure from Amazons ElasticMapReduce service.

The preferences need to be available in HDFS or S3 in a simple text format where each line represents a single preference in the form

You need to pick a similarity measure (loglikelihood from [2] is a good choice for implicit feedback) as well as the maximum number of similar items per item to compute:

The resulting file contains the pairwise similarities in the form:

A small hint: Decrease the number given in the parameter maxCooccurrencesPerItem or sample down users with an extreme number of preferences if the computation takes too long.

If you run the computation on ElasticMapReduce, consider using the excellent Python library boto that let’s you easily upload data and start Hadoop jobs in the Amazon cloud.

Setting up the infrastructure for the live recommender system

We have a way to precompute the item-similarities now, so let’s continue to plan the setup for the live recommender machines.

The first thing we need is a datastore for the preference data and the item-similarities. Mahout supports MySQL out of the box , so we’ll use this in this example but it is not to hard to create customized implementations for other datastores if you’re no MySQL fan.

The preference data will be managed by MySQLBooleanPrefJDBCDataModel, you need to create the table as described in the JavaDoc and insert your data.

The item-similarities will also reside in the database and will be managed by Mahout’s MySQLJDBCInMemoryItemSimilarity, create the table for this and make sure it always holds the latest result of the Hadoop computation.

Putting the puzzle together

Pick the servlet container and webframework of your choice and create a small java application for the recommender system. To get a minimal working recommender, we just need to instantiate and put together a handful of objects as described here:

What components do we exactly use here?

We setup a DataSource to connect to MySQL (make sure it’s pooled in production systems) and use this to create a DataModel to access the preference data stored in the table taste_preferences.

We also create an implementation of ItemSimilarity, the one we use will load all the precomputed similarities  from taste_item_similarity into memory on application start.

In the last two lines we instantiate an implementation of CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy that decides how to find the initial set of items that could possibly be recommended and use all components to setup our ItembasedRecommender.

Computing most-similar-items

As we already said at the beginning more than 95% of requests to the recommender system will go to recommender.mostSimilarItems(…) so we have to make sure these calls are blazingly fast.

Fortunately we don’t have to do anything for that because MySQLJDBCInMemoryItemSimilarity has already loaded all the item similarities into memory and AllSimilarItemsCandidateItemsStrategy ensures that all possibly similar items are read only from this in-memory representation. So there is not a single database call or disk access required for most-similar-items computations.

As the system is only doing some memory lookups, a most-similar-items computation should be done in milliseconds. If you have some complicated application code wrapped around your recommender or make use of a Rescorer that applies filters that might take some time, you could also put a simple cache like ehcache in front of your recommender application.

If you benchmark your application, a single machine should easily be able to handle 100 most-similar-items requests per second with this setup.

Personalized recommendations

Recommendations for logged-in users (calls to recommender.recommend(…)) only account for a tiny fraction of our overall requests. We didn’t want to have our hundreds of millions of preferences lie around in RAM, so we moved them to the database. In order to compute personalized recommendations we need to make exactly one database call that fetches the users preferences from MySQL, the rest of the computation will again only be memory lookups.

Typically the time for computing the recommendations for one user will be dominated by the call to the database over the network, yet the whole process should not take longer than 50 or 100 ms overall.

Incorporating new data

New preferences as well as new users can simply be added to the table taste_preferences in realtime, so personalized recommendations will always be up to date.

New items will however only be visible after the table taste_item_similarity has been updated with a result of the Hadoop computation that included interaction data for that item. Therefore the inclusion time for new items depends on how fast you can gather interaction data for those new items and how often you perform the offline computation. As stated in the beginning your usecase should allow one or two days for that.

Some final thoughts on scalability

What exactly constitutes the scalability of this setup? Let’s have a look at the building blocks of the recommender system and discuss what can be done to handle more data or more requests.

1. Increase of data in the similarity computation
This is pretty much a no-brainer as Hadoop already offers the possibility to add more machines to decrease running time. ItemSimilarityJob should scale linearly, though in some cases users with an extreme number of preferences need to be sampled down.
2. Increase of requests to the live machines
To handle increasing traffic to the live recommenders, simply add more machines. The live recommenders are stateless so you can just loadbalance the traffic to a pool of them.
3. Increase of data and users
The user preferences need to be stored in the database in a way that the preferences of a single user can be retrieved very fast as a whole. A single MySQL machine should be able to hold lots of this data. If the data does not fit on a single machine anymore,  it shouldn’t be to hard to partition the table across several machines and create a custom DataModel for that.
4. Increase of items
The bottleneck of this setup is certainly the number of items as the whole item-similarity-matrix has to fit in the memory of each live recommender machine and cannot be partitioned. However as we only have to store something like the 20 to 50 most similar items per item, the number of items has to go up high in the millions until that matrix takes more memory than available in todays lower-end server machines.

Sources

Tagged in , , ,