Last week, I implemented 4 (of 5) recommenders from the Programming Assignments of the Introduction to Recommender Systems course on Coursera, but using Apache Mahout and Scala instead of Lenskit and Java. This week, I implement an Item Item Collaborative Filtering Recommender that uses Lucene (more specifically, Lucene's More Like This query) as the item similarity provider.
By default, Lucene stores document vectors keyed by terms, but can be configured to store term vectors by setting the field attribute TermVector.YES. In case of text documents, words (or terms) are the features which are used to compute similarity between documents. I am using the same dataset as last week, where movies (items) correspond to documents and movie tags correspond to the words. So we build a movie "document" by preprocessing the tags to form individual tokens and concatenating them into a tags field in the index.
Three scenarios are covered. The first two are similar to the scenarios covered with the item-item collaborative filtering recommender from last week, where the user is on a movie page, and we need to (a) predict the rating a user would given a specified movie and (b) find movies similar to a given movie. The third scenario is recommending movies to a given user. We describe each algorithm briefly, and how Lucene fits in.
Find Movies Similar to given Movie: This is just content based filtering and is implemented as a simple MLT query. Given the itemID, we lookup the docID of the source movie, then get the top N movies that are most like it. We then return a List of tuples of docIDs that are similar and their similarities (scores), except the original docID.
Predict a User's Rating for a Movie: This is the prediction functionality of an item-item CF recommender. The prediction is based on how the user has rated other movies similar to this one. Otherwise, we calculate the average weighted sum of the ratings of already rated items, where the weights are the similarities between the target item and this item. If the movie is already rated, we just return the rating. Similarity between two items are calculated using the MLT query using a simplifying assumption - a target item outside the item neighborhood has 0 similarity with the source item. If we did not use this assumption, we would have to use approaches such as the TermFreqVector API for Lucene 3.x or the Fields API for Lucene 4.x to compute individual doc-doc similarities.
Recommend Movies to a User: This is topN recommender functionality of an item-item CF. We recommend movies that are similar to ones the user has already rated, weighted by the similarity between this item and the rated item. We use the algorithm outlined in Mahout in Action, § 4.4.1, detailed below. The 3rd and 4th lines in the algorithm is essentially the rating prediction task we described above. Essentially, we calculate the prediction for all items not rated so far by the user, and return them sorted by descending order of predicted rating.
1 2 3 4 5 | for item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted avera
|
Code
Here is the code for the Lucene based Item-Item Collaborative Filtering Recommender. The openIndex() method builds the Lucene index off the ratings.csv file if the index does not already exist, creating two fields - the docID, which is a keyword field, and tags, which is a whitespace tokenized text field with the TermVector attribute set to YES so MLT can work on it. The rest of the code just implements the algorithms explained above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | // Source: src/main/scala/com/mycompany/mia/recsys/LuceneIICFRecommender.scala
package com.mycompany.mia.recsys
import java.io.File
import java.io.StringReader
import scala.Array.canBuildFrom
import scala.collection.JavaConversions.asScalaIterator
import scala.collection.JavaConversions.iterableAsScalaIterable
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.apache.lucene.analysis.core.WhitespaceAnalyzer
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.document.Field.Index
import org.apache.lucene.document.Field.Store
import org.apache.lucene.document.Field.TermVector
import org.apache.lucene.index.DirectoryReader
import org.apache.lucene.index.IndexReader
import org.apache.lucene.index.IndexWriter
import org.apache.lucene.index.IndexWriterConfig
import org.apache.lucene.index.Term
import org.apache.lucene.queries.mlt.MoreLikeThis
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.search.TermQuery
import org.apache.lucene.store.SimpleFSDirectory
import org.apache.lucene.util.Version
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel
/**
* TopN Item Item Collaborative Filtering Recommender that
* uses Lucene as its source for Item-Item Similarity.
*/
class LuceneIICFRecommender(
modelfile: File, tagfile: File, indexdir: File) {
val model = new FileDataModel(modelfile)
val analyzer = new WhitespaceAnalyzer(Version.LUCENE_43)
val indexReader = openIndex(tagfile, indexdir)
val indexSearcher = new IndexSearcher(indexReader)
/**
* Given a user, return the topN items that the user
* may be interested in. Do not include items user has
* rated already.
* @param user the userID
* @param topN the number of items to recommend.
* @return a List of Pairs of itemID and similarity score.
*/
def recommend(user: Long, topN: Int): List[(Long,Double)] = {
model.getItemIDs()
.map(item => (item.toLong, predict(user, item)))
.filter(p => p._2 > 0.0D)
.toList
.sortWith((a,b) => a._2 > b._2)
.slice(0, topN)
}
/**
* Predict the rating that a user would give an item.
* If the user has already rated the item, we return
* the actual rating.
* @param user the user ID.
* @param item the item ID.
* @return the predicted rating the user would rate
* the item.
*/
def predict(user: Long, item: Long): Double = {
val ratedItems = getRatedItems(user)
if (ratedItems.contains(item))
model.getPreferenceValue(user, item).toDouble
else {
val nds = ratedItems.map(j => {
val simIJ = similarity(item, j, 20)
// val simIJ = similarity(item, j, this.cosine(_, _))
// val simIJ = similarity(item, j, this.tanimoto(_, _))
val rUJ = model.getPreferenceValue(user, j)
(simIJ * rUJ, simIJ)
})
val numer = nds.map(_._1).foldLeft(0.0D)(_ + _)
val denom = nds.map(_._2).foldLeft(0.0D)(_ + _)
numer / denom
}
}
/**
* Return a set of items that have been rated by
* this user.
* @param user the user ID.
* @return Set of items not yet rated by this user.
*/
def getRatedItems(user: Long): Set[Long] = {
model.getPreferencesFromUser(user)
.map(pref => pref.getItemID())
.toSet
}
/**
* Returns the similarity between two items, limited
* to the specified neighborhood size. If item is too
* dissimilar (ie out of the specified item neighborhood
* size) then the similarity returned is 0.0.
* @param itemI the item ID for the i-th item.
* @param itemJ the item ID for the j-th item.
* @param nnbrs item neighborhood size
* @return similarity between itemI and itemJ.
*/
def similarity(itemI: Long, itemJ: Long, nnbrs: Int): Double = {
val simItemScores = similarItems(itemI, nnbrs)
.filter(itemScore => itemScore._2 > 0.0D)
.toMap
simItemScores.getOrElse(itemJ, 0.0D)
}
/**
* Find a neighborhood of items of size nnbrs which are most
* similar to the item specified. Uses Lucene MoreLikeThis
* query to calculate the similarity.
* @param item the source item.
* @param nnbrs the neighborhood size.
* @return a List of (item ID, similarity) tuples representing
* the item neighborhood.
*/
def similarItems(item: Long, nnbrs: Int): List[(Long,Double)] = {
val docID = getFromIndex(item)
if (docID < 0) List()
else {
val mlt = new MoreLikeThis(indexReader)
mlt.setMinTermFreq(0)
mlt.setMinDocFreq(0)
mlt.setFieldNames(Array[String]("tags"))
mlt.setAnalyzer(analyzer)
val doc = indexReader.document(docID)
val tags = doc.getValues("tags").mkString(" ")
val mltq = mlt.like(new StringReader(tags), null)
val rhits = indexSearcher.search(mltq, nnbrs + 1).scoreDocs
rhits.map(rhit => {
val rdoc = indexReader.document(rhit.doc)
(rdoc.get("itemID").toLong, rhit.score.toDouble)
})
.toList
.filter(docsim => docsim._1 != item)
}
}
/**
* Calculate similarity between two items specified
* by item ID using the specified similarity function.
* @param itemI the item ID for the first item.
* @param itemJ the item ID for the second item.
* @param simfunc the similarity function to use.
* @return the similarity between itemI and itemJ.
*/
def similarity(itemI: Long, itemJ: Long,
simfunc: (Map[String,Long], Map[String,Long]) => Double):
Double = {
simfunc.apply(termVector(itemI), termVector(itemJ))
}
/**
* Extract the term vector for an item as a sparse
* map of tags to raw tag frequencies.
* @param item the item ID
* @return the term vector for the item.
*/
def termVector(item: Long): Map[String,Long] = {
val docID = getFromIndex(item)
val terms = indexReader.getTermVector(docID, "tags")
val termsEnum = terms.iterator(null)
Stream.continually(termsEnum.next())
.takeWhile(term => term != null)
.map(term => (term.utf8ToString(), termsEnum.totalTermFreq()))
.toMap
}
/**
* Implementation of cosine similarity using Maps.
* @param vecA Map representation of sparse vector
* for itemA
* @param vecB Map representation of sparse vector
* for itemB.
* @return the cosine similarity between vecA and
* vecB (normalized by Euclidean norm).
*/
def cosine(vecA: Map[String,Long],
vecB: Map[String,Long]): Double = {
val dotProduct = vecA.keySet.intersect(vecB.keySet)
.map(key => vecA(key) * vecB(key))
.foldLeft(0.0D)(_ + _)
val normA = scala.math.sqrt(vecA.values
.map(v => scala.math.pow(v, 2.0D))
.foldLeft(0.0D)(_ + _))
val normB = scala.math.sqrt(vecB.values
.map(v => scala.math.pow(v, 2.0D))
.foldLeft(0.0D)(_ + _))
dotProduct / (normA * normB)
}
/**
* Implementation of Tanimoto coefficient using Maps.
* @param vecA Map representation of sparse vector
* for itemA
* @param vecB Map representation of sparse vector
* for itemB.
* @return the Tanimoto coefficient between vecA and
* vecB.
*/
def tanimoto(vecA: Map[String,Long],
vecB: Map[String,Long]): Double = {
val num = vecA.keySet.intersect(vecB.keySet).size.toDouble
val den = vecA.keySet.union(vecB.keySet).size.toDouble
num / den
}
/**
* Convenience method to get a docID from the Lucene
* index by item ID.
* @param the itemID for the item.
* @return the corresponding docID from Lucene.
*/
def getFromIndex(item: Long): Int = {
val hits = indexSearcher.search(
new TermQuery(new Term("itemID", item.toString)), 1)
if (hits.totalHits == 0) -1 else hits.scoreDocs(0).doc
}
/**
* Create a Lucene index from the movie tags file if it does
* not exist already, then return a handle to the IndexReader.
* @param tagfile the File representing the movie-tags.csv
* @param indexdir the Lucene index directory.
* @return the reference to the IndexReader.
*/
def openIndex(tagfile: File, indexdir: File): IndexReader = {
if (! indexdir.exists()) {
// build index from data
indexdir.mkdirs();
val iwconf = new IndexWriterConfig(Version.LUCENE_43,
analyzer)
iwconf.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
val indexWriter = new IndexWriter(
new SimpleFSDirectory(indexdir), iwconf)
var prevItemID = -1L
var tagBuf = ArrayBuffer[String]()
Source.fromFile(tagfile)
.getLines()
.foreach(line => {
val Array(itemID, tag) = line.split(",")
if (itemID.toInt == prevItemID || prevItemID < 0L) {
tagBuf += tag.replaceAll(" ", "_").toLowerCase()
} else {
val doc = new Document()
doc.add(new Field("itemID", prevItemID.toString,
Store.YES, Index.NOT_ANALYZED))
doc.add(new Field("tags", tagBuf.mkString(" "),
Store.YES, Index.ANALYZED, TermVector.YES))
indexWriter.addDocument(doc)
tagBuf.clear
tagBuf += tag.replaceAll(" ", "_").toLowerCase()
}
prevItemID = itemID.toInt
})
val doc = new Document()
doc.add(new Field("itemID", prevItemID.toString,
Store.YES, Index.NOT_ANALYZED))
doc.add(new Field("tags", tagBuf.mkString(" "),
Store.YES, Index.ANALYZED, TermVector.YES))
indexWriter.addDocument(doc)
indexWriter.commit()
indexWriter.close()
}
DirectoryReader.open(new SimpleFSDirectory(indexdir))
}
}
|
Unit Test
I don't have test results to test my implementation, so I just ran some cases to make sure the results looked believable. The JUnit test below does some very basic tests and dumps out results for human validation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | // Source: src/test/scala/com/mycompany/mia/recsys/LuceneIICFRecommenderTest.scala
package com.mycompany.mia.recsys
import java.io.File
import scala.io.Source
import org.junit.Assert
import org.junit.Test
class LuceneIICFRecommenderTest {
val liicf = new LuceneIICFRecommender(
new File("data/recsys/ratings.csv"),
new File("data/recsys/movie-tags.csv"),
new File("data/recsys/itemindex"))
val titles = new File("data/recsys/movie-titles.csv")
val movieNames = Source.fromFile(titles)
.getLines()
.map(line => {
val Array(movieID, title) = line.split(",")
(movieID.toLong, title)
}).toMap
@Test def testRecommendItemsGivenUser(): Unit = {
val scores = liicf.recommend(15L, 20)
Assert.assertEquals(scores.size, 20)
Console.println("Recommendations for user(15)")
scores.foreach(score => {
Console.println("%5.3f %5d %s".format(
score._2, score._1, movieNames(score._1)))
})
}
@Test def testRecommendItemsGivenItem(): Unit = {
val recommendedItems = liicf.similarItems(77L, 10)
Assert.assertEquals(recommendedItems.size, 10)
Console.println("recommendations for movie(%d): %s"
.format(77L, movieNames(77L)))
recommendedItems.foreach(docsim => {
Console.println("%7.4f %5d %s"
.format(docsim._2, docsim._1, movieNames(docsim._1)))
})
}
@Test def testPredictRatingForItem(): Unit = {
val predictedRating = liicf.predict(2048L, 393L)
Console.println("prediction(2048,393) = " + predictedRating)
Assert.assertEquals(predictedRating, 3.69D, 0.01D)
val predictRatingForRatedItem = liicf.predict(2048L, 77L)
Console.println("prediction(2048,77) = " + predictRatingForRatedItem)
Assert.assertEquals(predictRatingForRatedItem, 4.5D, 0.1D)
}
}
|
Results
And finally, the results of the test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | Recommendations for user(15)
5.000 98 Gladiator (2000)
5.000 120 The Lord of the Rings: The Fellowship of the Ring (2001)
5.000 122 The Lord of the Rings: The Return of the King (2003)
5.000 180 Minority Report (2002)
5.000 280 Terminator 2: Judgment Day (1991)
5.000 603 The Matrix (1999)
5.000 604 The Matrix Reloaded (2003)
5.000 640 Catch Me If You Can (2002)
5.000 808 Shrek (2001)
5.000 1891 Star Wars: Episode V - The Empire Strikes Back (1980)
5.000 2502 The Bourne Supremacy (2004)
4.674 146 Crouching Tiger Hidden Dragon (Wo hu cang long) (2000)
4.634 11 Star Wars: Episode IV - A New Hope (1977)
4.634 1637 Speed (1994)
4.613 5503 The Fugitive (1993)
4.570 1894 Star Wars: Episode II - Attack of the Clones (2002)
4.549 24 Kill Bill: Vol. 1 (2003)
4.510 955 Mission: Impossible II (2000)
4.500 13 Forrest Gump (1994)
4.500 85 Raiders of the Lost Ark (1981)
recommendations for movie(77): Memento (2000)
0.3967 141 Donnie Darko (2001)
0.3199 38 Eternal Sunshine of the Spotless Mind (2004)
0.2358 629 The Usual Suspects (1995)
0.2260 807 Seven (a.k.a. Se7en) (1995)
0.1976 550 Fight Club (1999)
0.1609 745 The Sixth Sense (1999)
0.1079 63 Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
0.0841 680 Pulp Fiction (1994)
0.0609 393 Kill Bill: Vol. 2 (2004)
0.0585 274 The Silence of the Lambs (1991)
prediction(2048,393) = 3.691436473173808
prediction(2048,77) = 4.5
|
I considered using Lucene's Field API to compute document vectors and using Apache Mahout vectors to compute the cosine similarity instead of using the neighborhood approximation so I could use MLT, but then it wouldn't have been any different from the DataModel approach, so I stuck with using MLT. Variants of this approach can also be used to build user-user CF recommenders as well.
The code for this post is available at my mia-scala-examples on my GitHub page. The Source comment at the start of each code block indicates the location.
Update 2013-12-18: Based on Ravi's comment below about using the Tanimoto Coefficient, I started to rethink my approach of sticking to MLT for Lucene (convenient but suddenly too restrictive :-)). If we can model our item as a bag of features (bag of words for documents, bag of tags for movies in our example, etc), then Lucene's Fields API can provide an alternative approach to composing term vectors for each document. I have created a termVector method that does this, and a new overloaded similarity method which takes the two item IDs and a similarity function reference and calculates a similarity, as well as the Cosine Similarity and Tanimoto Coefficient functions. Using this, I was able to get results for these two similarity metrics (by switching around the computation for simIJ in lines 77-79 in the main code, which are shown below. In this case, the ordering seems to be fairly consistent across various similarity computations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | ==== Cosine Similarity (without neighborhood cliff) ====
Recommendations for user(15)
5.000 98 Gladiator (2000)
5.000 120 The Lord of the Rings: The Fellowship of the Ring (2001)
5.000 122 The Lord of the Rings: The Return of the King (2003)
5.000 180 Minority Report (2002)
5.000 280 Terminator 2: Judgment Day (1991)
5.000 603 The Matrix (1999)
5.000 604 The Matrix Reloaded (2003)
5.000 640 Catch Me If You Can (2002)
5.000 808 Shrek (2001)
5.000 1891 Star Wars: Episode V - The Empire Strikes Back (1980)
5.000 2502 The Bourne Supremacy (2004)
4.500 13 Forrest Gump (1994)
4.500 85 Raiders of the Lost Ark (1981)
4.500 1892 Star Wars: Episode VI - Return of the Jedi (1983)
4.500 2164 Stargate (1994)
4.500 2501 The Bourne Identity (2002)
4.500 36955 True Lies (1994)
4.417 954 Mission: Impossible (1996)
4.394 1637 Speed (1994)
4.389 955 Mission: Impossible II (2000)
recommendations for movie(77): Memento (2000)
0.3967 141 Donnie Darko (2001)
0.3199 38 Eternal Sunshine of the Spotless Mind (2004)
0.2358 629 The Usual Suspects (1995)
0.2260 807 Seven (a.k.a. Se7en) (1995)
0.1976 550 Fight Club (1999)
0.1609 745 The Sixth Sense (1999)
0.1079 63 Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
0.0841 680 Pulp Fiction (1994)
0.0609 393 Kill Bill: Vol. 2 (2004)
0.0585 274 The Silence of the Lambs (1991)
prediction(2048,393) = 4.11945476427452
prediction(2048,77) = 4.5
==== Tanimoto Coefficient Similarity ====
Recommendations for user(15)
5.000 98 Gladiator (2000)
5.000 120 The Lord of the Rings: The Fellowship of the Ring (2001)
5.000 122 The Lord of the Rings: The Return of the King (2003)
5.000 180 Minority Report (2002)
5.000 280 Terminator 2: Judgment Day (1991)
5.000 603 The Matrix (1999)
5.000 604 The Matrix Reloaded (2003)
5.000 640 Catch Me If You Can (2002)
5.000 808 Shrek (2001)
5.000 1891 Star Wars: Episode V - The Empire Strikes Back (1980)
5.000 2502 The Bourne Supremacy (2004)
4.500 13 Forrest Gump (1994)
4.500 85 Raiders of the Lost Ark (1981)
4.500 1892 Star Wars: Episode VI - Return of the Jedi (1983)
4.500 2164 Stargate (1994)
4.500 2501 The Bourne Identity (2002)
4.500 36955 True Lies (1994)
4.288 1894 Star Wars: Episode II - Attack of the Clones (2002)
4.258 955 Mission: Impossible II (2000)
4.238 8358 Cast Away (2000)
recommendations for movie(77): Memento (2000)
0.3967 141 Donnie Darko (2001)
0.3199 38 Eternal Sunshine of the Spotless Mind (2004)
0.2358 629 The Usual Suspects (1995)
0.2260 807 Seven (a.k.a. Se7en) (1995)
0.1976 550 Fight Club (1999)
0.1609 745 The Sixth Sense (1999)
0.1079 63 Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
0.0841 680 Pulp Fiction (1994)
0.0609 393 Kill Bill: Vol. 2 (2004)
0.0585 274 The Silence of the Lambs (1991)
prediction(2048,393) = 4.175819889778504
prediction(2048,77) = 4.5
|
Nice example Sujit. Have you tried Tanimoto coefficient for similarity ? For our item-item recommenders aka collaborative filtering it produced better results than cosine
ReplyDeleteThanks,
Ravi Kiran Bhaskar
Thanks Ravi. I haven't tried it here, but I did notice this also in another application. I think using binarized feature vectors results in greater spread and makes for better results. Can't do that here with MLT though...need to extract vectors and do that manually.
ReplyDeleteHmm...seeing your results almost being same for cosine and tanimoto is confusing and raises a question if Tanimoto is a bit data hungry ?? :-)
ReplyDeleteIam confused how do you explain both cosine and tanimoto giving same results...is cosine an edge case of tanimoto or vice versa ??
ReplyDeleteRavi Kiran Bhaskar
I thought of it like this - since they are both similarity measures, they are bringing up similar movies but with is some difference because they are not identical. To Anonymous's point, it would probably make sense to try this with another dataset to see if the results are this close - intuitively I would have expected more difference also, but it could just be quirky data. To Ravi's point, I guess you could consider Tanimoto's numerator a dot product of binarized ratings, so in that sense similar to Cosine.
ReplyDeleteYou have put together great framework for benchmark. I am interest in expanding this by adding SVD but first I want to tweak your Item-Item and user-user to run on spark. Any thoughts on that? Could you provide me few pointers / examples how to convert this to run on spark while using mahout? Thanks.
ReplyDeleteThanks. Did you perhaps mean to comment on my previous post? As you probably know, in case you just want to use it to solve a problem, Spark MLLib already provides an ALS implementation to discover a latent "taste" space, similar to SVD, and Mahout provides user-user and item-item algorithms that run on Spark. I haven't thought too deeply about the implementation details, but I figure that user-user treats users as vectors of items and item-item treats items as vectors of users, and the idea is to compute similarities between these vectors, which are essentially dot products, so I would just start from there.
ReplyDeleteYes, I know about ALS and came across Mahout built in functions. I thought as for publication purpose Spark framework lack in all the Recommendation Algorithms built in away where people can understand (how you have build in using mahout https://github.com/sujitpal/mia-scala-examples/tree/master/src/main/scala/com/mycompany/mia/recsys) I thought what if I start moving this code to Spark start with Item-Item but spark version of it and build Funk-SVD etc so researchers can benefit.
ReplyDeleteSure, it makes sense to do that now that I understand your purpose. I thought maybe you just wanted to use it, in which case it would be simpler to just use whats available.
ReplyDeleteI was wondering if you have any pointers to covert your item-item to run in spark yet still use Mahout 0.9 or latest. For example if only have the "libraryDependencies += "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0" in my build.sbt we don't have access to "import org.apache.mahout.cf.taste.impl.model.file.FileDataModel". Let me know if you have any thoughts and sorry for the trouble :)
ReplyDeleteNo trouble at all :-). I haven't used Mahout since 0.9 after which there has been lots of activity to build new algorithms to run on Spark so my thoughts may not be relevant or correct. One possible package you may be able to use is Mahout Math. I haven't used it on Spark myself though, nowadays I use either the linear algebra stuff that MLLib provides or breeze. For FileDataModel, IIRC all it provides are methods that exploit the "standard" triple/quad format that the collaboration filtering code requires - if you are building a Spark version, it may be better to just use a Scala case class instead of pulling this across from the JAR (not sure if its Serializable, you will need it to be Serializable to use it on the Spark workers).
ReplyDeleteYeah, I think thats a good idea. Also how do you actually run your junit tests in mahout.. lets say I have created a package using sbt, how do I run this. I want to run this so I can bench mark error rate with the spark implementation afterwards. Thanks,
ReplyDeleteI just ran the job locally (non-HDFS) on Hadoop with a very small slice of the data and verify it worked correctly. I believe there are JUnit extensions to handle both Hadoop and Spark jobs, but never got around to using them.
ReplyDeleteI see in the blog post you have use sbt 'run-main.....' but when I looked at your RecommenderEvaluator and don't see you are calling ItemItemCollaborativeFilteringRecommender.scala. Also do I need to do anything extra to make it to run on hadoop other than "sbt run-main com.mycompany.mia.cf.RecommenderEvaluator.scala. Not sure how scala works on hadoop. Thanks in advance.
ReplyDeleteFor Hadoop you need to do "sbt assembly" which will create a fat jar with all dependencies including the scala compiler and runtime. Then you just use the fat jar same as a Java jar on Hadoop. I am afraid I might have misled you two comments up, sorry about that, just realized that the code in the blog post here uses Mahout math libraries only in local mode, it wasn't built to run on Hadoop (or take advantage of distributed computation that would expect for code running on Hadoop or Spark). I have some examples of running Scala code on Amazon EMR with Spark that show how to run Scala code on Spark (don't have examples of running Scala with Hadoop directly, before Spark I used Scalding with Scala mostly).
ReplyDeleteI was trying to execute ItemItemCollaborativeFilteringRecommenderTest using java -cp ... but I don't think thats the way to do. Could you advise on how to run it.
ReplyDeleteFor projects with sbt, I generally do something like this. This pulls in all the JAR files listed in the dependencies so no need to explicitly specify them via -cp.
ReplyDeleteunix_prompt$ sbt
sbt> test-only full.class.name.of.JUnitTest
I see but as a jar file on a single machine can it be executed?
ReplyDeleteYes, definitely. You would have to create an Object that extends App as the driver (analogous to the main method in a class in Java), then call it using java -cp.
ReplyDelete