The Event Recommendation Engine Challenge on Kaggle asks for a model that can match events to users given user and event metadata and some demographic information. I've been a Kaggle member for a while, but this was the very first time I actually submitted a solution. I came in 87th on the leaderboard when I submitted, with a MAP (Mean Average Precision) of 0.56252, compared to the baseline solution with MAP 0.51382 at position 125-126, and the top solution with MAP 0.72809.
So if you are looking for a better model, you should probably read this post by dolaameng. For my part, I did not expect to do too well, and I was pretty stoked that I actually reached the finish line - there were times in the last 20 days when I had doubts about that. In any case, I describe my solution here, hopefully some of you will help me by pointing out where I could do better and some others will find it helpful as a source of ideas.
The tools I used for the solution are described in my previous post - Python, Numpy, Scipy, Scikit-Learn, Pandas and Cython. There were quite a few false starts, and at one point my code was running so slow that I would have missed the public leaderboard deadline. However, alls well that ends well, and here I am. So on to the code...
Data
The challenge page has links to the data, but in case you just want to read about it, I briefly describe the data files that were supplied.
1 2 3 4 5 6 7 8 | train.csv (user, event, invited, timestamp, interested, not_interested)
test.csv (user, event, invited, timestamp)
users.csv (user_id, locale, birthyear, gender, joinedAt, location,
hometown, timezone)
events.csv (event_id, user_id, start_time, city, state, zip, country,
lat, lng, c_1, c_2, ..., c_100, c_other)
user_friends.csv (user, friends)
event_attendees.csv (event, yes, maybe, invited, no)
|
Some columns referenced above need some additional explanation. For example, the user/user_id and event/event_id both refer to numeric ids, the c_1, ..., c_100 are the frequencies of the top 100 words in the event titles, and c_other is the frequency of everything else. In user_friends.csv and event_attendees.csv, the friends, yes, maybe and no fields are space delimited collection of user_ids.
Given this data, the output that needs to be submitted looks like this:
1 2 3 | User,Events
1776192,"[2877501688, 3025444328, 4078218285, 1024025121, 2972428928]"
...
|
where the Events column is a list of events in descending order of attractiveness to the user.
Method Overview
As you can see, there is a wide variety of data available, suitable for building both user-based and item-based recommenders. In addition, there are a numberof other recommendation factors that can come from the user's social graph. I decided to build up the following recommenders:
- User based recommender - uses the preferences of other users who have a preference for the same event and their similarity to this user to compute a user_reco score.
- Item based recommender - there are actually two of these. Both use the preference the user has expressed for other events and the similarity between that event and this one. There are two scores generated for the different measures of event similarity, one based on event metadata and one based on event content. These two recommenders return the evt_p_score and evt_c_score respectively.
- User popularity - measured by the number of friends a user has. The idea here is that people with more friends are more outgoing, and hence are more likely to attend events. This generates the user_pop score.
- Friend Influence - the idea here is that if your friends are going to an event, you are too. The score measures the number of your friends that are going to this event, and generates the frnd_infl score.
- Event Popularity - the more popular an event is, measured by the people that are going to it, the more likely the user will go to it. This produces the event_pop score.
Each of the recommendation scores described above become new features to my dataset. Because the computation of the scores would benefit from having some data as matrices, I first generate the matrices and a couple of dictionaries and serialize them to disk for the next stage. Two of these data structures are just pickled {user_id: index} and {event_id: index} maps and the others are sparse matrices serialized into MatrixMarket files. The user similarity matrix and the first event similarity matrices use correlation as the similarity measure and the second event similarity matrix (built off the c_* values) use cosine similarity as the similarity measure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | # Source: BaseData.py
from __future__ import division
import itertools
import cPickle
import datetime
import hashlib
import locale
import numpy as np
import pycountry
import scipy.io as sio
import scipy.sparse as ss
import scipy.spatial.distance as ssd
from collections import defaultdict
from sklearn.preprocessing import normalize
class DataCleaner:
"""
Common utilities for converting strings to equivalent numbers
or number buckets.
"""
def __init__(self):
# load locales
self.localeIdMap = defaultdict(int)
for i, l in enumerate(locale.locale_alias.keys()):
self.localeIdMap[l] = i + 1
# load countries
self.countryIdMap = defaultdict(int)
ctryIdx = defaultdict(int)
for i, c in enumerate(pycountry.countries):
self.countryIdMap[c.name.lower()] = i + 1
if c.name.lower() == "usa":
ctryIdx["US"] = i
if c.name.lower() == "canada":
ctryIdx["CA"] = i
for cc in ctryIdx.keys():
for s in pycountry.subdivisions.get(country_code=cc):
self.countryIdMap[s.name.lower()] = ctryIdx[cc] + 1
# load gender id map
self.genderIdMap = defaultdict(int, {"male":1, "female":2})
def getLocaleId(self, locstr):
return self.localeIdMap[locstr.lower()]
def getGenderId(self, genderStr):
return self.genderIdMap[genderStr]
def getJoinedYearMonth(self, dateString):
dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")
return "".join([str(dttm.year), str(dttm.month)])
def getCountryId(self, location):
if (isinstance(location, str)
and len(location.strip()) > 0
and location.rfind(" ") > -1):
return self.countryIdMap[location[location.rindex(" ") + 2:].lower()]
else:
return 0
def getBirthYearInt(self, birthYear):
try:
return 0 if birthYear == "None" else int(birthYear)
except:
return 0
def getTimezoneInt(self, timezone):
try:
return int(timezone)
except:
return 0
def getFeatureHash(self, value):
if len(value.strip()) == 0:
return -1
else:
return int(hashlib.sha224(value).hexdigest()[0:4], 16)
def getFloatValue(self, value):
if len(value.strip()) == 0:
return 0.0
else:
return float(value)
class ProgramEntities:
"""
Creates reference sets for the entity instances we care about
for this exercise. The train and test files contain a small
subset of the data provided in the auxillary files.
"""
def __init__(self):
# count how many unique uesers and events are in the training file
uniqueUsers = set()
uniqueEvents = set()
eventsForUser = defaultdict(set)
usersForEvent = defaultdict(set)
for filename in ["../Data/train.csv", "../Data/test.csv"]:
f = open(filename, 'rb')
f.readline().strip().split(",")
for line in f:
cols = line.strip().split(",")
uniqueUsers.add(cols[0])
uniqueEvents.add(cols[1])
eventsForUser[cols[0]].add(cols[1])
usersForEvent[cols[1]].add(cols[0])
f.close()
self.userEventScores = ss.dok_matrix((len(uniqueUsers), len(uniqueEvents)))
self.userIndex = dict()
self.eventIndex = dict()
for i, u in enumerate(uniqueUsers):
self.userIndex[u] = i
for i, e in enumerate(uniqueEvents):
self.eventIndex[e] = i
ftrain = open("../Data/train.csv", 'rb')
ftrain.readline()
for line in ftrain:
cols = line.strip().split(",")
i = self.userIndex[cols[0]]
j = self.eventIndex[cols[1]]
self.userEventScores[i, j] = int(cols[4]) - int(cols[5])
ftrain.close()
sio.mmwrite("../Models/PE_userEventScores", self.userEventScores)
# find all unique user pairs and event pairs that we should
# look at. These should be users who are linked via an event
# or events that are linked via a user in either the training
# or test sets. This is to avoid useless calculations
self.uniqueUserPairs = set()
self.uniqueEventPairs = set()
for event in uniqueEvents:
users = usersForEvent[event]
if len(users) > 2:
self.uniqueUserPairs.update(itertools.combinations(users, 2))
for user in uniqueUsers:
events = eventsForUser[user]
if len(events) > 2:
self.uniqueEventPairs.update(itertools.combinations(events, 2))
cPickle.dump(self.userIndex, open("../Models/PE_userIndex.pkl", 'wb'))
cPickle.dump(self.eventIndex, open("../Models/PE_eventIndex.pkl", 'wb'))
class Users:
"""
Build the user/user similarity matrix for program users
"""
def __init__(self, programEntities, sim=ssd.correlation):
cleaner = DataCleaner()
nusers = len(programEntities.userIndex.keys())
fin = open("../Data/users.csv", 'rb')
colnames = fin.readline().strip().split(",")
self.userMatrix = ss.dok_matrix((nusers, len(colnames) - 2))
for line in fin:
cols = line.strip().split(",")
# consider the user only if he exists in train.csv
if programEntities.userIndex.has_key(cols[0]):
i = programEntities.userIndex[cols[0]]
self.userMatrix[i, 0] = cleaner.getLocaleId(cols[1])
self.userMatrix[i, 1] = cleaner.getBirthYearInt(cols[2])
self.userMatrix[i, 2] = cleaner.getGenderId(cols[3])
self.userMatrix[i, 3] = cleaner.getJoinedYearMonth(cols[4])
self.userMatrix[i, 4] = cleaner.getCountryId(cols[5])
self.userMatrix[i, 5] = cleaner.getTimezoneInt(cols[7])
fin.close()
# normalize the user matrix
self.userMatrix = normalize(self.userMatrix, norm="l1", axis=0, copy=False)
sio.mmwrite("../Models/US_userMatrix", self.userMatrix)
# calculate the user similarity matrix and save it for later
self.userSimMatrix = ss.dok_matrix((nusers, nusers))
for i in range(0, nusers):
self.userSimMatrix[i, i] = 1.0
for u1, u2 in programEntities.uniqueUserPairs:
i = programEntities.userIndex[u1]
j = programEntities.userIndex[u2]
if not self.userSimMatrix.has_key((i, j)):
usim = sim(self.userMatrix.getrow(i).todense(),
self.userMatrix.getrow(j).todense())
self.userSimMatrix[i, j] = usim
self.userSimMatrix[j, i] = usim
sio.mmwrite("../Models/US_userSimMatrix", self.userSimMatrix)
class UserFriends:
"""
Returns the friends of the specified user. The idea is
that (a) people with more friends are more likely to attend
events and (b) if your friend is going, its more likely for
you to go as well
"""
def __init__(self, programEntities):
nusers = len(programEntities.userIndex.keys())
self.numFriends = np.zeros((nusers))
self.userFriends = ss.dok_matrix((nusers, nusers))
fin = open("../Data/user_friends.csv", 'rb')
fin.readline() # skip header
ln = 0
for line in fin:
# if ln % 100 == 0:
# print "Loading line: ", ln
cols = line.strip().split(",")
user = cols[0]
if programEntities.userIndex.has_key(user):
friends = cols[1].split(" ")
i = programEntities.userIndex[user]
self.numFriends[i] = len(friends)
for friend in friends:
if programEntities.userIndex.has_key(friend):
j = programEntities.userIndex[friend]
# the objective of this score is to infer the degree to
# and direction in which this friend will influence the
# user's decision, so we sum the user/event score for
# this user across all training events.
eventsForUser = programEntities.userEventScores.getrow(j).todense()
score = eventsForUser.sum() / np.shape(eventsForUser)[1]
self.userFriends[i, j] += score
self.userFriends[j, i] += score
ln += 1
fin.close()
# normalize the arrays
sumNumFriends = self.numFriends.sum(axis=0)
self.numFriends = self.numFriends / sumNumFriends
sio.mmwrite("../Models/UF_numFriends", np.matrix(self.numFriends))
self.userFriends = normalize(self.userFriends, norm="l1", axis=0, copy=False)
sio.mmwrite("../Models/UF_userFriends", self.userFriends)
class Events:
"""
Builds the event-event similarity matrix and event content-content
similarity matrix for program events.
"""
def __init__(self, programEntities, psim=ssd.correlation, csim=ssd.cosine):
cleaner = DataCleaner()
fin = open("../Data/events.csv", 'rb')
fin.readline() # skip header
nevents = len(programEntities.eventIndex.keys())
self.eventPropMatrix = ss.dok_matrix((nevents, 7))
self.eventContMatrix = ss.dok_matrix((nevents, 100))
ln = 0
for line in fin.readlines():
# if ln > 10:
# break
cols = line.strip().split(",")
eventId = cols[0]
if programEntities.eventIndex.has_key(eventId):
i = programEntities.eventIndex[eventId]
self.eventPropMatrix[i, 0] = cleaner.getJoinedYearMonth(cols[2]) # start_time
self.eventPropMatrix[i, 1] = cleaner.getFeatureHash(cols[3]) # city
self.eventPropMatrix[i, 2] = cleaner.getFeatureHash(cols[4]) # state
self.eventPropMatrix[i, 3] = cleaner.getFeatureHash(cols[5]) # zip
self.eventPropMatrix[i, 4] = cleaner.getFeatureHash(cols[6]) # country
self.eventPropMatrix[i, 5] = cleaner.getFloatValue(cols[7]) # lat
self.eventPropMatrix[i, 6] = cleaner.getFloatValue(cols[8]) # lon
for j in range(9, 109):
self.eventContMatrix[i, j-9] = cols[j]
ln += 1
fin.close()
self.eventPropMatrix = normalize(self.eventPropMatrix,
norm="l1", axis=0, copy=False)
sio.mmwrite("../Models/EV_eventPropMatrix", self.eventPropMatrix)
self.eventContMatrix = normalize(self.eventContMatrix,
norm="l1", axis=0, copy=False)
sio.mmwrite("../Models/EV_eventContMatrix", self.eventContMatrix)
# calculate similarity between event pairs based on the two matrices
self.eventPropSim = ss.dok_matrix((nevents, nevents))
self.eventContSim = ss.dok_matrix((nevents, nevents))
for e1, e2 in programEntities.uniqueEventPairs:
i = programEntities.eventIndex[e1]
j = programEntities.eventIndex[e2]
if not self.eventPropSim.has_key((i,j)):
epsim = psim(self.eventPropMatrix.getrow(i).todense(),
self.eventPropMatrix.getrow(j).todense())
self.eventPropSim[i, j] = epsim
self.eventPropSim[j, i] = epsim
if not self.eventContSim.has_key((i,j)):
ecsim = csim(self.eventContMatrix.getrow(i).todense(),
self.eventContMatrix.getrow(j).todense())
self.eventContSim[i, j] = epsim
self.eventContSim[j, i] = epsim
sio.mmwrite("../Models/EV_eventPropSim", self.eventPropSim)
sio.mmwrite("../Models/EV_eventContSim", self.eventContSim)
class EventAttendees():
"""
Measures event popularity by the number of people attended vs not.
"""
def __init__(self, programEvents):
nevents = len(programEvents.eventIndex.keys())
self.eventPopularity = ss.dok_matrix((nevents, 1))
f = open("../Data/event_attendees.csv", 'rb')
f.readline() # skip header
for line in f:
cols = line.strip().split(",")
eventId = cols[0]
if programEvents.eventIndex.has_key(eventId):
i = programEvents.eventIndex[eventId]
self.eventPopularity[i, 0] = \
len(cols[1].split(" ")) - len(cols[4].split(" "))
f.close()
self.eventPopularity = normalize(self.eventPopularity, norm="l1",
axis=0, copy=False)
sio.mmwrite("../Models/EA_eventPopularity", self.eventPopularity)
def main():
"""
Generate all the matrices and data structures required for further
calculations.
"""
print "calculating program entities..."
pe = ProgramEntities()
print "calculating user metrics..."
Users(pe)
print "calculating user friend metrics..."
UserFriends(pe)
print "calculating event metrics..."
Events(pe)
print "calculating event popularity metrics..."
EventAttendees(pe)
if __name__ == "__main__":
main()
|
The next step deserializes these matrices and data structures, and uses them to compute the features described. The code below has functions that take one or more columns from each row and produce one or more values representing the new features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | # Source: RegressionData.py
from __future__ import division
import cPickle
import numpy as np
import scipy.io as sio
class DataRewriter:
def __init__(self):
self.userIndex = cPickle.load(open("../Models/PE_userIndex.pkl", 'rb'))
self.eventIndex = cPickle.load(open("../Models/PE_eventIndex.pkl", 'rb'))
self.userEventScores = sio.mmread("../Models/PE_userEventScores").todense()
self.userSimMatrix = sio.mmread("../Models/US_userSimMatrix").todense()
self.eventPropSim = sio.mmread("../Models/EV_eventPropSim").todense()
self.eventContSim = sio.mmread("../Models/EV_eventContSim").todense()
self.numFriends = sio.mmread("../Models/UF_numFriends")
self.userFriends = sio.mmread("../Models/UF_userFriends").todense()
self.eventPopularity = sio.mmread("../Models/EA_eventPopularity").todense()
def userReco(self, userId, eventId):
"""
for item i
for every other user v that has a preference for i
compute similarity s between u and v
incorporate v's preference for i weighted by s into running aversge
return top items ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
vs = self.userEventScores[:, j]
sims = self.userSimMatrix[i, :]
prod = sims * vs
try:
return prod[0, 0] - self.userEventScores[i, j]
except IndexError:
return 0
def eventReco(self, userId, eventId):
"""
for item i
for every item j tht u has a preference for
compute similarity s between i and j
add u's preference for j weighted by s to a running average
return top items, ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
js = self.userEventScores[i, :]
psim = self.eventPropSim[:, j]
csim = self.eventContSim[:, j]
pprod = js * psim
cprod = js * csim
pscore = 0
cscore = 0
try:
pscore = pprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass
try:
cscore = cprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass
return pscore, cscore
def userPop(self, userId):
"""
Measures user popularity by number of friends a user has. People
with more friends tend to be outgoing and are more likely to go
to events
"""
if self.userIndex.has_key(userId):
i = self.userIndex[userId]
try:
return self.numFriends[0, i]
except IndexError:
return 0
else:
return 0
def friendInfluence(self, userId):
"""
Measures friends influence by the friends who are known (from the
training set) to go or not go to an event. The average of scores across
all friends of the user is the influence score.
"""
nusers = np.shape(self.userFriends)[1]
i = self.userIndex[userId]
return (self.userFriends[i, :].sum(axis=0) / nusers)[0,0]
def eventPop(self, eventId):
"""
Measures event popularity by the number attending and not attending.
"""
i = self.eventIndex[eventId]
return self.eventPopularity[i, 0]
def rewriteData(self, start=1, train=True, header=True):
"""
Create new features based on various recommender scores. This
is so we can figure out what weights to use for each recommender's
scores.
"""
fn = "train.csv" if train else "test.csv"
fin = open("../Data/" + fn, 'rb')
fout = open("../NewData/" + fn, 'wb')
# write output header
if header:
ocolnames = ["invited", "user_reco", "evt_p_reco",
"evt_c_reco", "user_pop", "frnd_infl", "evt_pop"]
if train:
ocolnames.append("interested")
ocolnames.append("not_interested")
fout.write(",".join(ocolnames) + "\n")
ln = 0
for line in fin:
ln += 1
if ln < start:
continue
cols = line.strip().split(",")
userId = cols[0]
eventId = cols[1]
invited = cols[2]
print "%s:%d (userId, eventId)=(%s, %s)" % (fn, ln, userId, eventId)
user_reco = self.userReco(userId, eventId)
evt_p_reco, evt_c_reco = self.eventReco(userId, eventId)
user_pop = self.userPop(userId)
frnd_infl = self.friendInfluence(userId)
evt_pop = self.eventPop(eventId)
ocols = [invited, user_reco, evt_p_reco,
evt_c_reco, user_pop, frnd_infl, evt_pop]
if train:
ocols.append(cols[4]) # interested
ocols.append(cols[5]) # not_interested
fout.write(",".join(map(lambda x: str(x), ocols)) + "\n")
fin.close()
fout.close()
def rewriteTrainingSet(self):
self.rewriteData(True)
def rewriteTestSet(self):
self.rewriteData(False)
# When running with cython, the actual class will be converted to a .so
# file, and the following code (along with the commented out import below)
# will need to be put into another .py and this should be run.
#import CRegressionData as rd
def main():
dr = DataRewriter()
print "rewriting training data..."
dr.rewriteData(train=True, start=2, header=False)
print "rewriting test data..."
dr.rewriteData(train=False, start=2, header=True)
if __name__ == "__main__":
main()
|
These features replace the original train.csv and test.csv files, so now they look like this:
1 2 3 4 | train.csv(invited, user_reco, evt_p_reco, evt_c_reco, user_pop,
frnd_infl, evt_pop, interested, not_interested)
test.csv(invited, user_reco, evt_p_reco, evt_c_reco, user_pop,
frnd_infl, evt_pop)
|
Incidentally, this was the part that was dog-slow. After suffering through two days during which it wrote out the new train.csv, I finally figured out the cause - the first is that I was using sparse matrices for my reference matrices and having to build up the column and row for the matrix multiplication was taking too much time, so the first step was to replace it with (regular) dense NumPy matrices during deserialization. I also ended up converting it to a shared object using Cython. As a speed comparison, the resulting executable ran through the 10K record test set in about a minute, compared to two days for processing the 15K record training set. I describe the Cython conversion in a separate section below.
Now I train a Stochastic Gradient Descent classifier from Scikits-Learn with my modified training set and build a one-vs-all classifier model to predict the value of the "interested" outcome.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | # Source: RecoWeights.py
from __future__ import division
import math
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
from sklearn.linear_model import SGDClassifier
def train():
"""
Trains a classifier on the entire (modified) training dataset.
Since our objective is to predict only interested users, we
only consider the outcome 1=interested and 0=not.
"""
trainDf = pd.read_csv("../NewData/train.csv")
X = np.matrix(pd.DataFrame(trainDf, index=None,
columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco",
"user_pop", "frnd_infl", "evt_pop"]))
y = np.array(trainDf.interested)
clf = SGDClassifier(loss="log", penalty="l2")
clf.fit(X, y)
return clf
def validate():
"""
Runs a 10-fold cross validation on the classifier, reporting
accuracy.
"""
trainDf = pd.read_csv("../NewData/train.csv")
X = np.matrix(pd.DataFrame(trainDf, index=None,
columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco",
"user_pop", "frnd_infl", "evt_pop"]))
y = np.array(trainDf.interested)
nrows = len(trainDf)
kfold = KFold(nrows, 10)
avgAccuracy = 0
run = 0
for train, test in kfold:
Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
clf = SGDClassifier(loss="log", penalty="l2")
clf.fit(Xtrain, ytrain)
accuracy = 0
ntest = len(ytest)
for i in range(0, ntest):
yt = clf.predict(Xtest[i, :])
if yt == ytest[i]:
accuracy += 1
accuracy = accuracy / ntest
print "accuracy (run %d): %f" % (run, accuracy)
avgAccuracy += accuracy
run += 1
print "Average accuracy", (avgAccuracy / run)
def test(clf):
"""
Reads the X values from the dataframe provided, then uses the
trained classifier to write an array of outcomes.
"""
origTestDf = pd.read_csv("../Data/test.csv")
users = origTestDf.user
events = origTestDf.event
testDf = pd.read_csv("../NewData/test.csv")
fout = open("../NewData/result.csv", 'wb')
fout.write(",".join(["user", "event", "outcome", "dist"]) + "\n")
nrows = len(testDf)
Xp = np.matrix(testDf)
yp = np.zeros((nrows, 2))
for i in range(0, nrows):
xp = Xp[i, :]
yp[i, 0] = clf.predict(xp)
yp[i, 1] = clf.decision_function(xp)
fout.write(",".join(map(lambda x: str(x),
[users[i], events[i], yp[i, 0], yp[i, 1]])) + "\n")
fout.close()
def main():
# validate()
clf = train()
test(clf)
if __name__ == "__main__":
main()
|
Running a 10-fold cross validation yields an accuracy number of 0.676043972845. Running the classifier against the modified test set yields another temporary file. The user and event columns come from the original test.csv and the outcome and dist are the predicted outcome from the classifier and the distance of the actual point from the predicted hyperplane. So for (user, event) pairs with a predicted outcome of 1, higher values of dist imply a better match.
1 2 3 | user,event,outcome,dist
1776192,2877501688,0.0,-1.10395723304
...
|
The final step is to convert this file to the required output format. This is done using the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # Source: ResultFormat.py
from __future__ import division
import pandas as pd
def byDist(x, y):
return int(y[1] - x[1])
def main():
# output file
fout = open("../NewData/final_result.csv", 'wb')
fout.write(",".join(["User", "Events"]) + "\n")
resultDf = pd.read_csv("../NewData/result.csv")
# group remaining user/events
grouped = resultDf.groupby("user")
for name, group in grouped:
user = str(name)
tuples = zip(list(group.event), list(group.dist), list(group.outcome))
# tuples = filter(lambda x: x[2]==1, tuples)
tuples = sorted(tuples, cmp=byDist)
events = "\"" + str(map(lambda x: x[0], tuples)) + "\""
fout.write(",".join([user, events]) + "\n")
fout.close()
if __name__ == "__main__":
main()
|
My first attempt was to take both interested and non-interested users and assign them the list of interesting events. This is probably what is expected (because when I put the filter in to remove the interested==0 rows, my score dropped).
All the code in this post is available on my GitHub repository.
6 comments (moderated to prevent spam):
Can you share the files which are created in models folder
Nice post!
Two things that I would try on top of what you have tried:
1. Gender -- there could be events that are more interesting more to men and some events to women.
2. Location: perhaps the closer the event city to the user location, the more likely they go.
Not sure, how easy is it to embedd them as features, just suggestions. Good to see you participating in kaggle. I had once participated too, in the StackOverflow closed question prediction contest (ranked 24th out of 46):
http://www.kaggle.com/c/predict-closed-questions-on-stack-overflow
@Anonymous: sorry I missed your comment, for some reason I missed the comment email and only saw it recently from the Blogger web GUI. Unfortunately I don't have the files anymore. I tried to regenerate them from re-downloaded input files but it appears that some formats may have changed midway through the challenge which I didn't pick up. Not sure if you still need them, let me know if you do, and I will fix the code and regenerate them.
@Dmitry: thanks for the suggestions. I did not think to match up user location to event location, which would have been a useful indicator. Gender is already part of the user features and contributes (perhaps too little at the moment) to the user similarity. Thanks for the link to the SO contest, I'll check it out - you did much better than I did :-).
Thanks so much for posting this - a really clear way of doing it.
I notice you're using the pycountry library - what is this for and is this absolutely necessary? I'm just having issues to get it to install.
Thanks Paul. I used pycountry to extract country names (for non US and Canada) and state/province names (for US and Canada) from the location field in the data. You can obviously do without it, and if I built a solution today, I probably would.
Hi Rachana, first off, my apologies for the late reply, I wasn't getting notified about the comments. Regarding your first question, its been a while since I did this, so if you could provide more details such as what your inputs, outputs and processing code was w.r.t the described pipeline, that would be very helpful. Your second point is a good one -- obviously the fact that someone hasn't responded is not sufficient indication that user is uninterested, but I figured it would be an useful approximation.
Post a Comment