In the Introduction to Recommender Systems course from Coursera that I am taking, there was one entire module dedicated to evaluation of different recommender systems. The course is built around the very comprehensive Lenskit recommender library, which provides a Groovy DSL to perform large scale evaluations. The assignment for this module consisted of evaluating 9 algorithms, each rated against 6 metrics either individually or across 10 different user neighborhood sizes (where the performance of the algorithm is dependent on neighborhood size), then answering some questions based on the data collected.
This sounds like a formidable task, but its really just a single Groovy DSL, and the script was mostly already provided. The real work was to analyze the data and answer the questions. The recommended tool for the analysis was R, but I am more comfortable with Python, and I wanted to check out Pandas, a Python library that provides R-like abstractions for data analysis. I didn't know Pandas, but figured it would be a good opportunity to learn it for this assignment. What follows is my attempt to use Pandas to analyze the data from the evaluation.
To learn about Pandas, I watched this 3 hour long video on Youtube. In it, Wes McKinney, the creator of Pandas, presents the library at Pycon 2013 at Santa Clara. Its a bit of time investment, but I thought it was well worth it, because it gives you a feel for the library and what you can do with it using a bunch of case studies. There is also a Pandas book, also by Wes McKinney, and there is a reference guide on the Pandas site (I used this also to figure out things I got stuck on).
Before we start on the analysis, we need to understand how the data is structured. We do this using a bit of exploratory analysis at the Python shell. As we can see, there are 15 columns in our dataset, of which the last 11 are metrics. Of these 11 metrics, we are going to be concerned only (because of the questions that follow) with only 5 of these - TestTime, RMSE.ByUser, nDCG, TopN.nDCG and TagEntropy@10.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | >>> import pandas as pd
>>> df = pd.read_csv("eval-results.csv")
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 270 entries, 0 to 269
Data columns (total 15 columns):
Algorithm 270 non-null values
Partition 270 non-null values
DataSet 270 non-null values
NNbrs 250 non-null values
BuildTime 270 non-null values
TestTime 270 non-null values
NUsers 270 non-null values
NAttempted 270 non-null values
NGood 270 non-null values
Coverage 265 non-null values
RMSE.ByRating 265 non-null values
RMSE.ByUser 265 non-null values
nDCG 265 non-null values
TopN.nDCG 265 non-null values
TagEntropy@10 265 non-null values
dtypes: float64(7), int64(6), object(2)
|
We see that there are 11 algorithms being tested. Of these, some are tested across user neighborhoods varying in size from 5 to 100, and the others are not evaluated with user neighborhoods at all (as can be seen from the nan in the result for df.NNbrs.unique()). This latter class contains GlobalMean, Popular, ItemMean and PersMean, which are not affected by changes in the size of the user neighborhood. We also see that the evaluation uses 5-fold cross validation (from the result of df.Partition.unique()) and that the DataSet column has only 1 value and can be safely ignored.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | >>> df.Algorithm.unique()
array(['GlobalMean', 'Popular', 'ItemMean', 'PersMean', 'Lucene',
'LuceneNorm', 'UserUserCosine', 'UserUser', 'UserUserNorm'],
dtype=object)
>>> df.NNbrs.unique()
array([ nan, 5., 10., 15., 20., 25., 30., 40., 50.,
75., 100.])
>>> df.ix[pd.isnull(df.NNbrs)]["Algorithm"].unique()
array(['GlobalMean', 'Popular', 'ItemMean', 'PersMean'], dtype=object)
>>> df.ix[-pd.isnull(df.NNbrs)]["Algorithm"].unique()
array(['Lucene', 'LuceneNorm', 'UserUserCosine', 'UserUser', 'UserUserNorm'],
dtype=object)
>>> df.Partition.unique()
array([0, 1, 2, 3, 4])
>>> df.DataSet.unique()
array(['RecSysMOOC'], dtype=object)
|
Now that we have some sense of how the data is set up, lets move on to question answering! I plan on using as much built-in Pandas functionality as possible (or as much as I know how to). To keep things simple, each question is answered using analysis that stands on its own, at the cost of some code duplication. This is the same way I did the actual analysis, localizing each analysis to its own method and calling it one by one from the main() method.
Q2: How many neighbors are needed for normalized Lucene to beat personalized mean on RMSE?
Here "RMSE" implies "RMSE.ByUser" (this was clarified later by course staff). Also "normalized Lucene" is called LuceneNorm and "personalized mean" is called PersMean in the data. As we noted earlier, LuceneNorm was run multiple times per dataset with varying NNbr (number of neighbors), while PersMean was run only once. So if we were to plot the mean RMSE.ByUser across the datasets for each value of NNbr for LuceneNorm and a horizontal line representing the mean RMSE.ByUser across the datasets for PersMean, the intersection would give us the answer to our question.
The code is shown below. We first pull out only the rows and columns we care about for LuceneNorm and PersMean, group LuceneNorm data by NNbrs and generate the mean RMSE.ByUser, and compute the mean RMSE.ByUser for PersMean. We then set them up in a new "plotting" DataFrame so we can use Pandas's DataFrame.plot() method to draw a chart with all the default bells and whistles (much easier than doing this manually via Matplotlib).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def q2():
df = pd.read_csv("eval-results.csv")
# select out LuceneNorm rows and only NNbrs and RMSE.ByUser columns
df_lucenenorm = df.ix[df.Algorithm == "LuceneNorm",
["NNbrs", "RMSE.ByUser"]]
# select out PersMean rows and only NNbrs and RMSE.ByUser columns
# in case of PersMean, neighborhood size does not matter so no
# experiments were performed with varying NNbrs, so our plot will
# be a horizontal line across all NNbrs values.
df_persmean = df.ix[df.Algorithm == "PersMean", ["RMSE.ByUser"]]
# we want average RMSE.ByUser values per NNbr in our plot so we
# create a new DataFrame based on the mean RMSE.ByUser value per NNbr
df_plot = pd.DataFrame(df_lucenenorm.groupby(["NNbrs"]).mean())
df_plot.rename(columns={"RMSE.ByUser" : "RMSE.ByUser.LuceneNorm"},
inplace=True)
persmean_mean = df_persmean["RMSE.ByUser"].mean()
# add this column to the df_plot
df_plot["RMSE.ByUser.PersMean"] = persmean_mean
df_plot.plot()
plt.show()
|
The graph generated by this method is shown below. As you can see the answer is approximately a neighborhood size of 43.
Q3: How many neighbors are needed for normalized Lucene to beat personalized mean on prediction nDCG?
This is similar to Q2, except that we are asked to consider a different metric. The code is also almost identical, as shown below:
1 2 3 4 5 6 7 8 9 10 | def q3():
df = pd.read_csv("eval-results.csv")
df_lucenenorm = df.ix[df.Algorithm == "LuceneNorm", ["NNbrs", "nDCG"]]
df_persmean = df.ix[df.Algorithm == "PersMean", ["nDCG"]]
df_plot = pd.DataFrame(df_lucenenorm.groupby(["NNbrs"]).mean())
df_plot.rename(columns={"nDCG" : "nDCG.LuceneNorm"}, inplace=True)
persmean_mean = df_persmean["nDCG"].mean()
df_plot["nDCG.PersMean"] = persmean_mean
df_plot.plot()
plt.show()
|
The graph for this analysis is shown below. As you can see, the answer is a neighborhood size of approximately 73 users.
Q4: When is Lucene CBF the best algorithm? Choices are: (a) Never; (b) On nDCG, when it is normalized and has many neighbors; (c) On RMSE with a medium number of neighbors.
There are 2 questions here, one for the RMSE.ByUser and another for the nDCG metric, and data for all algorithms has to be analyzed for each metric. If we plot the metric value over NNbr for all the algorithms, one plot per metric, then we can have our answer. Building up this plot is a bit complex but is basically just an extension of the approach in Q2 and Q3.
For convenience, we move the plotting code into its own function, because answers to subsequent questions involve similar functionality. We first split up the algorithms into the ones that are "NNbrs-aware" vs ones that are not. We happen to know that the NNbrs-aware ones are always run for the same NNbrs set. We loop through this set, pulling out rows and the columns "NNbrs" and the metric column (RMSE.ByUser in the first case, nDCG in the second), grouping by NNbrs and computing the mean across all datasets, then joining each set of algorithm means to a "plotting" DataFrame by NNbrs. We then loop through the non-NNbrs-aware algorithms, computing their mean across all datasets, and sticking a column containing this mean value to our plotting DataFrame. At each stage, we rename the metric column by suffixing the algorithm name to it, and finally we just call plot() on the resulting DataFrame. Here is the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | def plot_all_algos(metric, ref_algos=None, report_file=None):
df = pd.read_csv("eval-results.csv")
# Before starting, split algos into neighbor aware and neighbor unaware
non_nnbr_algos = df.ix[pd.isnull(df.NNbrs)]["Algorithm"].unique()
nnbr_algos = df.ix[-pd.isnull(df.NNbrs)]["Algorithm"].unique()
df_plot = None
for algo in nnbr_algos:
if ref_algos is not None and algo not in ref_algos: continue
dfa = df.ix[df.Algorithm == algo, ["NNbrs", metric]]
if df_plot is None:
df_plot = pd.DataFrame(dfa.groupby(["NNbrs"]).mean())
else:
df_plot = df_plot.join(dfa.groupby(["NNbrs"]).mean())
df_plot.rename(columns={metric : "%s.%s" % (metric, algo)}, inplace=True)
# now add in the columns representing the non-nnbr algos
for algo in non_nnbr_algos:
if ref_algos is not None and algo not in ref_algos: continue
algo_mean = df.ix[df.Algorithm == algo, [metric]].mean()[0]
df_plot["%s.%s" % (metric, algo)] = algo_mean
if report_file is not None:
f = open(report_file, 'wb')
f.write(df_plot.transpose().to_html())
f.close()
df_plot.plot(title=metric)
plt.show()
|
For question 4, we can create two plots, one for RMSE.ByUser and one for nDCG, using the code below:
1 2 3 | def q4():
for metric in ["RMSE.ByUser", "nDCG"]:
plot_all_algos(metric)
|
The plots are shown below. Inspecting them tells us that the correct answer is (b), ie, on nDCG, when it is normalized and has many neighbors.
Q5: What algorithm produces the most diverse Top-10 lists (by our entropy metric) (a) Popular Movies; (b) Unnormalized Lucene; (c) Item/Pers. Mean; (d) Normalized Lucene; (e) User-User CF.
The metric of interest in this question is the TagEntropy@10 metric. The "most diversity" corresponds to highest values of TagEntropy@10. To answer the question, we can call the plot_all_algos() function we built earlier. The plot shows that the most diverse spot is claimed by two algorithms, whose horizontal plots are overlaid on each other. A plot is thus not enough to tell us which algorithm produces the most diverse top-10 list. So we add an extra parameter to our plot_all_algos function to optionally print out an HTML table with the values just before plotting it. For readability, we transpose the tables, so each algorithm is represented by a row in this table.
1 2 | def q5():
plot_all_algos("TagEntropy@10", report_file="/tmp/entropy.html")
|
NNbrs | 5.0 | 10.0 | 15.0 | 20.0 | 25.0 | 30.0 | 40.0 | 50.0 | 75.0 | 100.0 |
---|---|---|---|---|---|---|---|---|---|---|
TagEntropy@10.Lucene | 167.0 | 167.1 | 169.3 | 171.4 | 175.0 | 178.5 | 183.9 | 185.9 | 186.6 | 186.5 |
TagEntropy@10.LuceneNorm | 214.9 | 220.6 | 223.9 | 225.2 | 225.6 | 225.8 | 225.6 | 224.8 | 223.9 | 223.9 |
TagEntropy@10.UserUserCosine | 214.7 | 220.3 | 223.3 | 225.0 | 226.3 | 227.2 | 228.5 | 229.2 | 230.7 | 231.4 |
TagEntropy@10.UserUser | 209.9 | 217.9 | 222.0 | 224.1 | 225.7 | 227.0 | 228.2 | 229.5 | 230.9 | 231.7 |
TagEntropy@10.UserUserNorm | 207.1 | 215.1 | 219.6 | 222.0 | 223.9 | 225.2 | 226.8 | 228.2 | 230.0 | 231.1 |
TagEntropy@10.GlobalMean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
TagEntropy@10.Popular | 193.7 | 193.7 | 193.7 | 193.7 | 193.7 | 193.7 | 193.7 | 193.7 | 193.7 | 193.7 |
TagEntropy@10.ItemMean | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 |
TagEntropy@10.PersMean | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 | 237.8 |
From the table, it's quite clear that the correct answer is (c) Item/Pers. Mean.
Q6: Does increasing neighborhood size generally increase or decrease the tag entropy of top-10 lists? (a) Increase; (b) Decrease.
This can be answered from the graph for the previous question. Where the graph is not horizontal, the general trend seems to be an increase in TagEntropy@10 with increase in NNbrs.
Q8: In practice, recommenders cost money to run and it isn't worthwhile to run recommenders that take a lot of computational power and provide little benefit. Based on this experiment, what algorithm would be best to deploy for recommending items (in ranked lists) from this data set? (a) Popular; (b) Normalized Lucene; (c) User-user CF; (d) Mean Rating; (e) Unnormalized Lucene.
We want to find recommenders that are not too expensive to run, so we use TestTime as our metric. Generating the plot of TestTime across all algorithms is once again simply a matter of calling plot_all_algos(). As in Q5, however, the algorithms with low runtimes are all clumped together, so its difficult to differentiate, so as before we generate the corresponding HTML table. The calling code, chart and data are shown below:
1 2 | def q8():
plot_all_algos("TestTime", report_file="/tmp/testtime.html")
|
NNbrs | 5.0 | 10.0 | 15.0 | 20.0 | 25.0 | 30.0 | 40.0 | 50.0 | 75.0 | 100.0 |
---|---|---|---|---|---|---|---|---|---|---|
TestTime.Lucene | 5931.8 | 4896.0 | 5178.2 | 4994.0 | 5493.2 | 5626.0 | 6005.6 | 6070.2 | 6252.4 | 6358.6 |
TestTime.LuceneNorm | 6465.4 | 7224.0 | 7190.6 | 7106.4 | 7550.8 | 7624.0 | 7927.4 | 7755.4 | 7921.2 | 8172.4 |
TestTime.UserUserCosine | 197427.8 | 199182.4 | 203132.0 | 204139.2 | 203278.0 | 205560.0 | 209793.6 | 209759.2 | 212834.6 | 206995.0 |
TestTime.UserUser | 216341.6 | 219634.6 | 222249.0 | 223269.4 | 222814.6 | 224616.6 | 226187.8 | 229578.2 | 231668.0 | 233883.6 |
TestTime.UserUserNorm | 228488.8 | 230338.8 | 233950.4 | 236373.4 | 234727.8 | 235640.0 | 240503.6 | 241874.4 | 245372.0 | 239330.6 |
TestTime.GlobalMean | 329.2 | 329.2 | 329.2 | 329.2 | 329.2 | 329.2 | 329.2 | 329.2 | 329.2 | 329.2 |
TestTime.Popular | 5512.8 | 5512.8 | 5512.8 | 5512.8 | 5512.8 | 5512.8 | 5512.8 | 5512.8 | 5512.8 | 5512.8 |
TestTime.ItemMean | 7355.8 | 7355.8 | 7355.8 | 7355.8 | 7355.8 | 7355.8 | 7355.8 | 7355.8 | 7355.8 | 7355.8 |
TestTime.PersMean | 7569.4 | 7569.4 | 7569.4 | 7569.4 | 7569.4 | 7569.4 | 7569.4 | 7569.4 | 7569.4 | 7569.4 |
From the data above, it is clear that the winner on TestTime is GlobalMean. However, GlobalMean is more of a baseline, and we should choose the next best one, which is (a) Popular.
Q9: Ignoring entropy, what user-user configuration generally performs the best? (a) Normalized w/ Cosine; (b) Unnormalized; (c) Normalized
To answer this question, we should look at the 4 ratings RMSE.ByRating, RMSE.ByUser, nDCG and TopN.nDCG against UserUserCosine, UserUser and UserUserNorm algorithms. We add another parameter to our plot_all_algos to only plot for a restricted set of algorithms if such a set is passed in. After this, we can plot the four metrics using the following code:
1 2 3 4 5 6 | def q9():
ref_algos=["UserUserCosine", "UserUser", "UserUserNorm"]
plot_all_algos("RMSE.ByRating", ref_algos=ref_algos)
plot_all_algos("RMSE.ByUser", ref_algos=ref_algos)
plot_all_algos("nDCG", ref_algos=ref_algos)
plot_all_algos("TopN.nDCG", ref_algos=ref_algos)
|
From the set of 4 charts generated from this code, it is clear that UserUserCosine, aka (a) Normalized w/Cosine, is the clear winner in this respect.
Q10: One algorithm has increasing entropy for low neighborhood sizes, and then entropy starts going down. Which is it? (a) Unnormalized Lucene; (b) Lucene with Normalization; (c) User-user CF with Cosine; (d) Unnormalized User-User CF
This question can be answered from the graph in Q5. It is LuceneNorm, aka (b) Lucene with Normalization, represented by the green line with TagEntropy@10 between 220 and 230.
Q11: What algorithms beat Popular Items in recommendation list entropy? (a) Normalized Lucene; (b) Item/Pers. Mean; (c) User-user CF; (d) Unnormalized Lucene.
Once again, this can be answered from the graph and data for Q5. Popular has higher TagEntropy@10 for all the algorithms except for Lucene (aka Unnormalized Lucene), so the answer is (a), (b) and (c).
Q12: What is the Top-N nDCG of Popular?
To answer this question, we isolate the TopN.nDCG column for the algorithm Popular and return the mean. The code is shown below, the answer is 0.475202.
1 2 3 4 | def q12():
df = pd.read_csv("eval-results.csv")
print("TopN.nDCG(Popular) = %f" %
(df.ix[df.Algorithm == "Popular", ["TopN.nDCG"]].mean()[0]))
|
The code is structured as a single Python script, and can be found, along with the data, at my GitHub page for this project.
I had lots of fun applying my newly acquired knowledge of Pandas to some real-world (well real-world-ish) data, and I hope you found this interesting as well. Being able to analyze data and extract information from it is an important skill, and Pandas/R can help you do that easily and efficiently. If you are reasonably proficient with Python, then learning Pandas may be easier than learning R.