Saturday, December 14, 2013

Using Lucene Similarity in Item-Item Recommenders


Last week, I implemented 4 (of 5) recommenders from the Programming Assignments of the Introduction to Recommender Systems course on Coursera, but using Apache Mahout and Scala instead of Lenskit and Java. This week, I implement an Item Item Collaborative Filtering Recommender that uses Lucene (more specifically, Lucene's More Like This query) as the item similarity provider.

By default, Lucene stores document vectors keyed by terms, but can be configured to store term vectors by setting the field attribute TermVector.YES. In case of text documents, words (or terms) are the features which are used to compute similarity between documents. I am using the same dataset as last week, where movies (items) correspond to documents and movie tags correspond to the words. So we build a movie "document" by preprocessing the tags to form individual tokens and concatenating them into a tags field in the index.

Three scenarios are covered. The first two are similar to the scenarios covered with the item-item collaborative filtering recommender from last week, where the user is on a movie page, and we need to (a) predict the rating a user would given a specified movie and (b) find movies similar to a given movie. The third scenario is recommending movies to a given user. We describe each algorithm briefly, and how Lucene fits in.

Find Movies Similar to given Movie: This is just content based filtering and is implemented as a simple MLT query. Given the itemID, we lookup the docID of the source movie, then get the top N movies that are most like it. We then return a List of tuples of docIDs that are similar and their similarities (scores), except the original docID.

Predict a User's Rating for a Movie: This is the prediction functionality of an item-item CF recommender. The prediction is based on how the user has rated other movies similar to this one. Otherwise, we calculate the average weighted sum of the ratings of already rated items, where the weights are the similarities between the target item and this item. If the movie is already rated, we just return the rating. Similarity between two items are calculated using the MLT query using a simplifying assumption - a target item outside the item neighborhood has 0 similarity with the source item. If we did not use this assumption, we would have to use approaches such as the TermFreqVector API for Lucene 3.x or the Fields API for Lucene 4.x to compute individual doc-doc similarities.

Recommend Movies to a User: This is topN recommender functionality of an item-item CF. We recommend movies that are similar to ones the user has already rated, weighted by the similarity between this item and the rated item. We use the algorithm outlined in Mahout in Action, § 4.4.1, detailed below. The 3rd and 4th lines in the algorithm is essentially the rating prediction task we described above. Essentially, we calculate the prediction for all items not rated so far by the user, and return them sorted by descending order of predicted rating.

1
2
3
4
5
    for item i that u has no preference for yet
      for every item j that u has a preference for
        compute a similarity s between i and j
        add u's preference for j, weighted by s, to a running average
    return the top items, ranked by weighted avera

Code


Here is the code for the Lucene based Item-Item Collaborative Filtering Recommender. The openIndex() method builds the Lucene index off the ratings.csv file if the index does not already exist, creating two fields - the docID, which is a keyword field, and tags, which is a whitespace tokenized text field with the TermVector attribute set to YES so MLT can work on it. The rest of the code just implements the algorithms explained above.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
// Source: src/main/scala/com/mycompany/mia/recsys/LuceneIICFRecommender.scala
package com.mycompany.mia.recsys

import java.io.File
import java.io.StringReader

import scala.Array.canBuildFrom
import scala.collection.JavaConversions.asScalaIterator
import scala.collection.JavaConversions.iterableAsScalaIterable
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import org.apache.lucene.analysis.core.WhitespaceAnalyzer
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.document.Field.Index
import org.apache.lucene.document.Field.Store
import org.apache.lucene.document.Field.TermVector
import org.apache.lucene.index.DirectoryReader
import org.apache.lucene.index.IndexReader
import org.apache.lucene.index.IndexWriter
import org.apache.lucene.index.IndexWriterConfig
import org.apache.lucene.index.Term
import org.apache.lucene.queries.mlt.MoreLikeThis
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.search.TermQuery
import org.apache.lucene.store.SimpleFSDirectory
import org.apache.lucene.util.Version
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel

/**
 * TopN Item Item Collaborative Filtering Recommender that
 * uses Lucene as its source for Item-Item Similarity.
 */
class LuceneIICFRecommender(
    modelfile: File, tagfile: File, indexdir: File) {

  val model = new FileDataModel(modelfile)

  val analyzer = new WhitespaceAnalyzer(Version.LUCENE_43)
  val indexReader = openIndex(tagfile, indexdir)
  val indexSearcher = new IndexSearcher(indexReader)

  /**
   * Given a user, return the topN items that the user
   * may be interested in. Do not include items user has
   * rated already.
   * @param user the userID
   * @param topN the number of items to recommend.
   * @return a List of Pairs of itemID and similarity score.
   */
  def recommend(user: Long, topN: Int): List[(Long,Double)] = {
    model.getItemIDs()
      .map(item => (item.toLong, predict(user, item))) 
      .filter(p => p._2 > 0.0D)
      .toList
      .sortWith((a,b) => a._2 > b._2)
      .slice(0, topN)
  }

  /**
   * Predict the rating that a user would give an item.
   * If the user has already rated the item, we return
   * the actual rating.
   * @param user the user ID.
   * @param item the item ID.
   * @return the predicted rating the user would rate 
   *         the item.
   */
  def predict(user: Long, item: Long): Double = {
    val ratedItems = getRatedItems(user)
    if (ratedItems.contains(item)) 
      model.getPreferenceValue(user, item).toDouble
    else {
      val nds = ratedItems.map(j => {
        val simIJ = similarity(item, j, 20)
//        val simIJ = similarity(item, j, this.cosine(_, _))
//        val simIJ = similarity(item, j, this.tanimoto(_, _))
        val rUJ = model.getPreferenceValue(user, j)
        (simIJ * rUJ, simIJ)
      })
      val numer = nds.map(_._1).foldLeft(0.0D)(_ + _)
      val denom = nds.map(_._2).foldLeft(0.0D)(_ + _)
      numer / denom
    }
  }

  /**
   * Return a set of items that have been rated by
   * this user.
   * @param user the user ID.
   * @return Set of items not yet rated by this user.
   */
  def getRatedItems(user: Long): Set[Long] = {
    model.getPreferencesFromUser(user)
      .map(pref => pref.getItemID())
      .toSet
  }

  /**
   * Returns the similarity between two items, limited
   * to the specified neighborhood size. If item is too
   * dissimilar (ie out of the specified item neighborhood
   * size) then the similarity returned is 0.0.
   * @param itemI the item ID for the i-th item.
   * @param itemJ the item ID for the j-th item.
   * @param nnbrs item neighborhood size
   * @return similarity between itemI and itemJ.
   */
  def similarity(itemI: Long, itemJ: Long, nnbrs: Int): Double = {
    val simItemScores = similarItems(itemI, nnbrs)
      .filter(itemScore => itemScore._2 > 0.0D)
      .toMap
    simItemScores.getOrElse(itemJ, 0.0D)
  }
  
  /**
   * Find a neighborhood of items of size nnbrs which are most
   * similar to the item specified. Uses Lucene MoreLikeThis
   * query to calculate the similarity.
   * @param item the source item.
   * @param nnbrs the neighborhood size.
   * @return a List of (item ID, similarity) tuples representing
   *         the item neighborhood.
   */
  def similarItems(item: Long, nnbrs: Int): List[(Long,Double)] = {
    val docID = getFromIndex(item)
    if (docID < 0) List()
    else {
      val mlt = new MoreLikeThis(indexReader)
      mlt.setMinTermFreq(0)
      mlt.setMinDocFreq(0)
      mlt.setFieldNames(Array[String]("tags"))
      mlt.setAnalyzer(analyzer)
      val doc = indexReader.document(docID)
      val tags = doc.getValues("tags").mkString(" ")
      val mltq = mlt.like(new StringReader(tags), null)
      val rhits = indexSearcher.search(mltq, nnbrs + 1).scoreDocs
      rhits.map(rhit => {
        val rdoc = indexReader.document(rhit.doc)
        (rdoc.get("itemID").toLong, rhit.score.toDouble)
      })
      .toList
      .filter(docsim => docsim._1 != item)
    }
  }

  /**
   * Calculate similarity between two items specified
   * by item ID using the specified similarity function.
   * @param itemI the item ID for the first item.
   * @param itemJ the item ID for the second item.
   * @param simfunc the similarity function to use.
   * @return the similarity between itemI and itemJ.
   */
  def similarity(itemI: Long, itemJ: Long, 
      simfunc: (Map[String,Long], Map[String,Long]) => Double): 
      Double = {
    simfunc.apply(termVector(itemI), termVector(itemJ))
  }
  
  /**
   * Extract the term vector for an item as a sparse
   * map of tags to raw tag frequencies.
   * @param item the item ID
   * @return the term vector for the item.
   */
  def termVector(item: Long): Map[String,Long] = {
    val docID = getFromIndex(item)
    val terms = indexReader.getTermVector(docID, "tags")
    val termsEnum = terms.iterator(null)
    Stream.continually(termsEnum.next())
      .takeWhile(term => term != null)
      .map(term => (term.utf8ToString(), termsEnum.totalTermFreq()))
      .toMap
  }

  
  /**
   * Implementation of cosine similarity using Maps.
   * @param vecA Map representation of sparse vector
   *             for itemA
   * @param vecB Map representation of sparse vector
   *             for itemB.
   * @return the cosine similarity between vecA and
   *             vecB (normalized by Euclidean norm).
   */
  def cosine(vecA: Map[String,Long], 
      vecB: Map[String,Long]): Double = {
    val dotProduct = vecA.keySet.intersect(vecB.keySet)
      .map(key => vecA(key) * vecB(key))
      .foldLeft(0.0D)(_ + _)
    val normA = scala.math.sqrt(vecA.values
      .map(v => scala.math.pow(v, 2.0D))
      .foldLeft(0.0D)(_ + _))
    val normB = scala.math.sqrt(vecB.values
      .map(v => scala.math.pow(v, 2.0D))
      .foldLeft(0.0D)(_ + _))
    dotProduct / (normA * normB)
  }
  
  /**
   * Implementation of Tanimoto coefficient using Maps.
   * @param vecA Map representation of sparse vector
   *             for itemA
   * @param vecB Map representation of sparse vector
   *             for itemB.
   * @return the Tanimoto coefficient between vecA and
   *             vecB.
   */
  def tanimoto(vecA: Map[String,Long], 
      vecB: Map[String,Long]): Double = {
    val num = vecA.keySet.intersect(vecB.keySet).size.toDouble
    val den = vecA.keySet.union(vecB.keySet).size.toDouble
    num / den
  }

  /**
   * Convenience method to get a docID from the Lucene
   * index by item ID.
   * @param the itemID for the item.
   * @return the corresponding docID from Lucene.
   */
  def getFromIndex(item: Long): Int = {
    val hits = indexSearcher.search(
      new TermQuery(new Term("itemID", item.toString)), 1)
    if (hits.totalHits == 0) -1 else hits.scoreDocs(0).doc 
  }

  /**
   * Create a Lucene index from the movie tags file if it does
   * not exist already, then return a handle to the IndexReader.
   * @param tagfile the File representing the movie-tags.csv
   * @param indexdir the Lucene index directory.
   * @return the reference to the IndexReader.
   */
  def openIndex(tagfile: File, indexdir: File): IndexReader = {
    if (! indexdir.exists()) {
      // build index from data
      indexdir.mkdirs();
      val iwconf = new IndexWriterConfig(Version.LUCENE_43, 
        analyzer)
      iwconf.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
      val indexWriter = new IndexWriter(
        new SimpleFSDirectory(indexdir), iwconf)
      var prevItemID = -1L
      var tagBuf = ArrayBuffer[String]()
      Source.fromFile(tagfile)
        .getLines()
        .foreach(line => {
           val Array(itemID, tag) = line.split(",")
           if (itemID.toInt == prevItemID || prevItemID < 0L) {
             tagBuf += tag.replaceAll(" ", "_").toLowerCase()
           } else {
             val doc = new Document()
             doc.add(new Field("itemID", prevItemID.toString, 
               Store.YES, Index.NOT_ANALYZED))
             doc.add(new Field("tags", tagBuf.mkString(" "), 
               Store.YES, Index.ANALYZED, TermVector.YES))
             indexWriter.addDocument(doc)
             tagBuf.clear
             tagBuf += tag.replaceAll(" ", "_").toLowerCase()
           }
           prevItemID = itemID.toInt
        })
      val doc = new Document()
      doc.add(new Field("itemID", prevItemID.toString, 
        Store.YES, Index.NOT_ANALYZED))
      doc.add(new Field("tags", tagBuf.mkString(" "), 
        Store.YES, Index.ANALYZED, TermVector.YES))
      indexWriter.addDocument(doc)
      indexWriter.commit()
      indexWriter.close()
    }
    DirectoryReader.open(new SimpleFSDirectory(indexdir))
  }
}

Unit Test


I don't have test results to test my implementation, so I just ran some cases to make sure the results looked believable. The JUnit test below does some very basic tests and dumps out results for human validation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Source: src/test/scala/com/mycompany/mia/recsys/LuceneIICFRecommenderTest.scala
package com.mycompany.mia.recsys

import java.io.File

import scala.io.Source

import org.junit.Assert
import org.junit.Test

class LuceneIICFRecommenderTest {

  val liicf = new LuceneIICFRecommender(
    new File("data/recsys/ratings.csv"), 
    new File("data/recsys/movie-tags.csv"),
    new File("data/recsys/itemindex"))
  val titles = new File("data/recsys/movie-titles.csv")
  val movieNames = Source.fromFile(titles)
    .getLines()
    .map(line => {
      val Array(movieID, title) = line.split(",")
      (movieID.toLong, title)
    }).toMap
    
  @Test def testRecommendItemsGivenUser(): Unit = {
    val scores = liicf.recommend(15L, 20)
    Assert.assertEquals(scores.size, 20)
    Console.println("Recommendations for user(15)")
    scores.foreach(score => {
      Console.println("%5.3f %5d %s".format(
        score._2, score._1, movieNames(score._1)))
    })
  }
  
  @Test def testRecommendItemsGivenItem(): Unit = {
    val recommendedItems = liicf.similarItems(77L, 10)
    Assert.assertEquals(recommendedItems.size, 10)
    Console.println("recommendations for movie(%d): %s"
      .format(77L, movieNames(77L)))
    recommendedItems.foreach(docsim => {
      Console.println("%7.4f %5d %s"
        .format(docsim._2, docsim._1, movieNames(docsim._1)))
    })
  }

  @Test def testPredictRatingForItem(): Unit = {
    val predictedRating = liicf.predict(2048L, 393L)
    Console.println("prediction(2048,393) = " + predictedRating)
    Assert.assertEquals(predictedRating, 3.69D, 0.01D)
    val predictRatingForRatedItem = liicf.predict(2048L, 77L)
    Console.println("prediction(2048,77) = " + predictRatingForRatedItem)
    Assert.assertEquals(predictRatingForRatedItem, 4.5D, 0.1D)
  }
}

Results


And finally, the results of the test:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Recommendations for user(15)
5.000    98 Gladiator (2000)
5.000   120 The Lord of the Rings: The Fellowship of the Ring (2001)
5.000   122 The Lord of the Rings: The Return of the King (2003)
5.000   180 Minority Report (2002)
5.000   280 Terminator 2: Judgment Day (1991)
5.000   603 The Matrix (1999)
5.000   604 The Matrix Reloaded (2003)
5.000   640 Catch Me If You Can (2002)
5.000   808 Shrek (2001)
5.000  1891 Star Wars: Episode V - The Empire Strikes Back (1980)
5.000  2502 The Bourne Supremacy (2004)
4.674   146 Crouching Tiger Hidden Dragon (Wo hu cang long) (2000)
4.634    11 Star Wars: Episode IV - A New Hope (1977)
4.634  1637 Speed (1994)
4.613  5503 The Fugitive (1993)
4.570  1894 Star Wars: Episode II - Attack of the Clones (2002)
4.549    24 Kill Bill: Vol. 1 (2003)
4.510   955 Mission: Impossible II (2000)
4.500    13 Forrest Gump (1994)
4.500    85 Raiders of the Lost Ark (1981)

recommendations for movie(77): Memento (2000)
 0.3967   141 Donnie Darko (2001)
 0.3199    38 Eternal Sunshine of the Spotless Mind (2004)
 0.2358   629 The Usual Suspects (1995)
 0.2260   807 Seven (a.k.a. Se7en) (1995)
 0.1976   550 Fight Club (1999)
 0.1609   745 The Sixth Sense (1999)
 0.1079    63 Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
 0.0841   680 Pulp Fiction (1994)
 0.0609   393 Kill Bill: Vol. 2 (2004)
 0.0585   274 The Silence of the Lambs (1991)

prediction(2048,393) = 3.691436473173808
prediction(2048,77) = 4.5

I considered using Lucene's Field API to compute document vectors and using Apache Mahout vectors to compute the cosine similarity instead of using the neighborhood approximation so I could use MLT, but then it wouldn't have been any different from the DataModel approach, so I stuck with using MLT. Variants of this approach can also be used to build user-user CF recommenders as well.

The code for this post is available at my mia-scala-examples on my GitHub page. The Source comment at the start of each code block indicates the location.

Update 2013-12-18: Based on Ravi's comment below about using the Tanimoto Coefficient, I started to rethink my approach of sticking to MLT for Lucene (convenient but suddenly too restrictive :-)). If we can model our item as a bag of features (bag of words for documents, bag of tags for movies in our example, etc), then Lucene's Fields API can provide an alternative approach to composing term vectors for each document. I have created a termVector method that does this, and a new overloaded similarity method which takes the two item IDs and a similarity function reference and calculates a similarity, as well as the Cosine Similarity and Tanimoto Coefficient functions. Using this, I was able to get results for these two similarity metrics (by switching around the computation for simIJ in lines 77-79 in the main code, which are shown below. In this case, the ordering seems to be fairly consistent across various similarity computations.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
==== Cosine Similarity (without neighborhood cliff) ====

Recommendations for user(15)
5.000    98 Gladiator (2000)
5.000   120 The Lord of the Rings: The Fellowship of the Ring (2001)
5.000   122 The Lord of the Rings: The Return of the King (2003)
5.000   180 Minority Report (2002)
5.000   280 Terminator 2: Judgment Day (1991)
5.000   603 The Matrix (1999)
5.000   604 The Matrix Reloaded (2003)
5.000   640 Catch Me If You Can (2002)
5.000   808 Shrek (2001)
5.000  1891 Star Wars: Episode V - The Empire Strikes Back (1980)
5.000  2502 The Bourne Supremacy (2004)
4.500    13 Forrest Gump (1994)
4.500    85 Raiders of the Lost Ark (1981)
4.500  1892 Star Wars: Episode VI - Return of the Jedi (1983)
4.500  2164 Stargate (1994)
4.500  2501 The Bourne Identity (2002)
4.500 36955 True Lies (1994)
4.417   954 Mission: Impossible (1996)
4.394  1637 Speed (1994)
4.389   955 Mission: Impossible II (2000)

recommendations for movie(77): Memento (2000)
 0.3967   141 Donnie Darko (2001)
 0.3199    38 Eternal Sunshine of the Spotless Mind (2004)
 0.2358   629 The Usual Suspects (1995)
 0.2260   807 Seven (a.k.a. Se7en) (1995)
 0.1976   550 Fight Club (1999)
 0.1609   745 The Sixth Sense (1999)
 0.1079    63 Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
 0.0841   680 Pulp Fiction (1994)
 0.0609   393 Kill Bill: Vol. 2 (2004)
 0.0585   274 The Silence of the Lambs (1991)

prediction(2048,393) = 4.11945476427452
prediction(2048,77) = 4.5

==== Tanimoto Coefficient Similarity ====

Recommendations for user(15)
5.000    98 Gladiator (2000)
5.000   120 The Lord of the Rings: The Fellowship of the Ring (2001)
5.000   122 The Lord of the Rings: The Return of the King (2003)
5.000   180 Minority Report (2002)
5.000   280 Terminator 2: Judgment Day (1991)
5.000   603 The Matrix (1999)
5.000   604 The Matrix Reloaded (2003)
5.000   640 Catch Me If You Can (2002)
5.000   808 Shrek (2001)
5.000  1891 Star Wars: Episode V - The Empire Strikes Back (1980)
5.000  2502 The Bourne Supremacy (2004)
4.500    13 Forrest Gump (1994)
4.500    85 Raiders of the Lost Ark (1981)
4.500  1892 Star Wars: Episode VI - Return of the Jedi (1983)
4.500  2164 Stargate (1994)
4.500  2501 The Bourne Identity (2002)
4.500 36955 True Lies (1994)
4.288  1894 Star Wars: Episode II - Attack of the Clones (2002)
4.258   955 Mission: Impossible II (2000)
4.238  8358 Cast Away (2000)

recommendations for movie(77): Memento (2000)
 0.3967   141 Donnie Darko (2001)
 0.3199    38 Eternal Sunshine of the Spotless Mind (2004)
 0.2358   629 The Usual Suspects (1995)
 0.2260   807 Seven (a.k.a. Se7en) (1995)
 0.1976   550 Fight Club (1999)
 0.1609   745 The Sixth Sense (1999)
 0.1079    63 Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
 0.0841   680 Pulp Fiction (1994)
 0.0609   393 Kill Bill: Vol. 2 (2004)
 0.0585   274 The Silence of the Lambs (1991)

prediction(2048,393) = 4.175819889778504
prediction(2048,77) = 4.5

19 comments (moderated to prevent spam):

Anonymous said...

Nice example Sujit. Have you tried Tanimoto coefficient for similarity ? For our item-item recommenders aka collaborative filtering it produced better results than cosine

Thanks,
Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi. I haven't tried it here, but I did notice this also in another application. I think using binarized feature vectors results in greater spread and makes for better results. Can't do that here with MLT though...need to extract vectors and do that manually.

Anonymous said...

Hmm...seeing your results almost being same for cosine and tanimoto is confusing and raises a question if Tanimoto is a bit data hungry ?? :-)

Anonymous said...

Iam confused how do you explain both cosine and tanimoto giving same results...is cosine an edge case of tanimoto or vice versa ??

Ravi Kiran Bhaskar

Sujit Pal said...

I thought of it like this - since they are both similarity measures, they are bringing up similar movies but with is some difference because they are not identical. To Anonymous's point, it would probably make sense to try this with another dataset to see if the results are this close - intuitively I would have expected more difference also, but it could just be quirky data. To Ravi's point, I guess you could consider Tanimoto's numerator a dot product of binarized ratings, so in that sense similar to Cosine.

Anonymous said...

You have put together great framework for benchmark. I am interest in expanding this by adding SVD but first I want to tweak your Item-Item and user-user to run on spark. Any thoughts on that? Could you provide me few pointers / examples how to convert this to run on spark while using mahout? Thanks.

Sujit Pal said...

Thanks. Did you perhaps mean to comment on my previous post? As you probably know, in case you just want to use it to solve a problem, Spark MLLib already provides an ALS implementation to discover a latent "taste" space, similar to SVD, and Mahout provides user-user and item-item algorithms that run on Spark. I haven't thought too deeply about the implementation details, but I figure that user-user treats users as vectors of items and item-item treats items as vectors of users, and the idea is to compute similarities between these vectors, which are essentially dot products, so I would just start from there.

Anonymous said...

Yes, I know about ALS and came across Mahout built in functions. I thought as for publication purpose Spark framework lack in all the Recommendation Algorithms built in away where people can understand (how you have build in using mahout https://github.com/sujitpal/mia-scala-examples/tree/master/src/main/scala/com/mycompany/mia/recsys) I thought what if I start moving this code to Spark start with Item-Item but spark version of it and build Funk-SVD etc so researchers can benefit.

Sujit Pal said...

Sure, it makes sense to do that now that I understand your purpose. I thought maybe you just wanted to use it, in which case it would be simpler to just use whats available.

Anonymous said...

I was wondering if you have any pointers to covert your item-item to run in spark yet still use Mahout 0.9 or latest. For example if only have the "libraryDependencies += "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0" in my build.sbt we don't have access to "import org.apache.mahout.cf.taste.impl.model.file.FileDataModel". Let me know if you have any thoughts and sorry for the trouble :)

Sujit Pal said...

No trouble at all :-). I haven't used Mahout since 0.9 after which there has been lots of activity to build new algorithms to run on Spark so my thoughts may not be relevant or correct. One possible package you may be able to use is Mahout Math. I haven't used it on Spark myself though, nowadays I use either the linear algebra stuff that MLLib provides or breeze. For FileDataModel, IIRC all it provides are methods that exploit the "standard" triple/quad format that the collaboration filtering code requires - if you are building a Spark version, it may be better to just use a Scala case class instead of pulling this across from the JAR (not sure if its Serializable, you will need it to be Serializable to use it on the Spark workers).

Anonymous said...

Yeah, I think thats a good idea. Also how do you actually run your junit tests in mahout.. lets say I have created a package using sbt, how do I run this. I want to run this so I can bench mark error rate with the spark implementation afterwards. Thanks,

Sujit Pal said...

I just ran the job locally (non-HDFS) on Hadoop with a very small slice of the data and verify it worked correctly. I believe there are JUnit extensions to handle both Hadoop and Spark jobs, but never got around to using them.

Anonymous said...

I see in the blog post you have use sbt 'run-main.....' but when I looked at your RecommenderEvaluator and don't see you are calling ItemItemCollaborativeFilteringRecommender.scala. Also do I need to do anything extra to make it to run on hadoop other than "sbt run-main com.mycompany.mia.cf.RecommenderEvaluator.scala. Not sure how scala works on hadoop. Thanks in advance.

Sujit Pal said...

For Hadoop you need to do "sbt assembly" which will create a fat jar with all dependencies including the scala compiler and runtime. Then you just use the fat jar same as a Java jar on Hadoop. I am afraid I might have misled you two comments up, sorry about that, just realized that the code in the blog post here uses Mahout math libraries only in local mode, it wasn't built to run on Hadoop (or take advantage of distributed computation that would expect for code running on Hadoop or Spark). I have some examples of running Scala code on Amazon EMR with Spark that show how to run Scala code on Spark (don't have examples of running Scala with Hadoop directly, before Spark I used Scalding with Scala mostly).

Anonymous said...

I was trying to execute ItemItemCollaborativeFilteringRecommenderTest using java -cp ... but I don't think thats the way to do. Could you advise on how to run it.

Sujit Pal said...

For projects with sbt, I generally do something like this. This pulls in all the JAR files listed in the dependencies so no need to explicitly specify them via -cp.

unix_prompt$ sbt
sbt> test-only full.class.name.of.JUnitTest

Anonymous said...

I see but as a jar file on a single machine can it be executed?

Sujit Pal said...

Yes, definitely. You would have to create an Object that extends App as the driver (analogous to the main method in a class in Java), then call it using java -cp.