The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  • Learn To Recommend
  • Binary classification recommendation system
  • Regression recommendation for User-item
  • Learn to recommend
  • Learn to recommend four essentials
  • Binary classification VS. Score based ranking system (LTR)
  • Metric for recommendation system
  • MAP (Mean Average Precision)
  • NDCG series
  1. Machine Learning

Recommendation

PreviousMachine Learning System DesignNextEssentials of Machine Learning

Last updated 6 years ago

Learn To Recommend

Binary classification recommendation system

目前广泛使用的一些推荐算法,本质上还是二分类算法,类似于“点击率预估”。尽管线上效果还不错,但是距推荐系统的实际需求还有差距:

  • 推荐系统中的“排序”任务关心的是相对顺序正确与否,而不是对“单个物料的点击概率”预测得是否精准。

  • 二分类优化的loss只基于“预测分数”与“真实值”的差异,对“预测分数”可导,可用Gradient Descent优化之。而推荐系统中的很多业务指标,如NDCG, MAP,是基于“排序位置”的。“预测分数”的微小变化,或者不足以改变顺序使metric不变,导数始终为0,或者改变排序而导致metric剧烈变化,不可导,使得无法用Gradient Descent优化。

Regression recommendation for User-item

movie_count = 17771
user_count = 2649430
model_left = Sequential()
model_left.add(Embedding(movie_count, 60, input_length=1))
model_right = Sequential()
model_right.add(Embedding(user_count, 20, input_length=1))
model = Sequential()
model.add(Merge([model_left, model_right], mode='concat'))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adadelta')

Learn to recommend

注意两点,

  • 以上binary loss function,对“列表头部的排序正确性”与“列表尾部的排序正确性”一视同仁,实际上优化的是AUC

LambdaRank/LambdaMART的解决思路,简单而被证明有效:既然无法定义一个符合要求的loss,那就不定义了,直接定义一个符合我们要求的梯度就行了。这个不经过loss,而被人为定义出来的梯度,就是所谓的Lambda梯度。

那么优化NDCG需要怎样的Lambda梯度?很简单,

  • 在上一轮迭代结束后,将候选文档/物料,按照上一轮迭代得到的预测分数从大到小排序

  • 对<Ui,Uj>这一对儿,如果还用binary cross entropy loss预测排序是否正确,其梯度定义和图3中一样

Learn to recommend four essentials

综上所述,我们可以看到,实现一个Learning-To-Rank算法,有4个重点:

  • 样本如何组织。显而易见,排序只对“同一个query下的候选文档”或“同一个推荐session下的候选商品”才有意义。所以,与普通二分类不同,LTR训练集有一个分组的概念,同一个组内的候选物料才能匹配成对,相互比较。

  • 定义loss。Learning-To-Rank有pointwise/pairwise/listwise三种定义loss的方式。Pointwise与普通的ctr预估无异,上面的例子介绍的就是pairwise, listwise以后找个机会再介绍。不同的loss定义方式,再结合优化不同指标而定义的不同lambda weight,可以衍化出更多种算法。

Binary classification VS. Score based ranking system (LTR)

  • LTR比目前常用的预测点击/转化的二分类算法,更加符合推荐系统的实际需求

  • 我经常使用 lightgbm, xgboost 来进行LTR,但是 GBDT 天生只擅长处理稠密特征,而“稀疏特征”才是推荐、搜索领域中的“一等公民”。TF-Ranking 基于 TensorFlow,可以充分利用 embedding, crossing, hashing 等手段来处理“稀疏特征”。

  • TF-Ranking 是一个框架,可以接入任意一种算法来计算 query-doc, user-item 之间的相关性,也可以在 pointwise/pairwise/listwise 这三大类 loss function 中方便切换。这使我们可以轻松尝试多种组合,比如基于 wide & deep 的 pairwise LTR,基于deepfm 的 listwise LTR。

  • TF-Ranking 是基于 TensorFlow Estimator 框架实现的,自动具备了分布式训练、部署的能力,可以应对海量数据集。

Metric for recommendation system

MAP (Mean Average Precision)

OK that was Average Precision, which applies to a single data point (like a single user). What about MAP@N? All that remains is to average the AP@N metric over all your |U||U|users. Yes, an average of an average.

MAP@N=1∣U∣∑u=1∣U∣(AP@N)u=1∣U∣∑u=1∣U∣1m∑k=1NPu(k)⋅relu(k)MAP@N=1|U|∑u=1|U|(AP@N)u=1|U|∑u=1|U|1m∑k=1NPu(k)⋅relu(k)MAP@N=1∣U∣∑u=1∣U∣(AP@N)u=1∣U∣∑u=1∣U∣1m∑k=1NPu(k)⋅relu(k)

NDCG series

CG(Cumulative Gain)

DCG (Discounted Cumulative Gain)

NDCG (Normalized Discounted Cumulative Gain)

用户输入一个query q,待排序的两篇文章Ui, Uj, si是模型给文章Ui打的分数,预测q与Ui的相关性,sj是模型给Uj打的分数,预测q与Uj的相关性。我们可以预测将Ui排在Uj前的概率,从而将排序转化成一个二分类问题, “Ui排在Uj前面”的概率定义如下(注意 是一个超参,不代表sigmoid函数)图1. 文章Ui排在Uj前的概率是si-sj的sigmoid函数

定义为”Ui排在Uj前面”这个论述的真实性(ground truth),1代表论述为真,0代表为假,0.5代表Ui与Uj位置应该相同。如下图所示,做一个简单的数据变换,用 来代表“Ui排在Uj前面”论述的真实性。从而我们可以定义binary cross entropy loss如下图2

我们再简化一下“训练集”的构成,因为如果一对儿文章“Ui排在Uj前”为真,那么”Uj排在Ui前”一定为假,导致冗余。所以,在训练集中,我们只需要保留所有 =1的训练样本。然后再让以上binary cross-entropy loss对”待优化变量 ”求导,则有图3

注意上式中,“预测得分s对w”的导数 和,可以由NN来实现,也可以由GBDT来实现,属于“开箱即用”的成熟技术,就不用赘言了。在这里,我们关心的是,即loss function对第i篇文章的预测得分的导数,注意这里我们用 来表示,这也就是LambdaRank, LambdaMART一系列算法名字的由来。

这里的还是一个真的梯度,区别于下面要说的为了优化某不可导指标而人为设计的梯度。

而如果我们要优化NDCG这样重点关注头部位置的指标,正如前所述,这些指标对单个文档的“预测得分”的微小变化,要么无动于衷,要么反应剧烈,即我们无法定义一个“既能优化NDCG,又对 连续可导”的loss。那怎么办?

图4. 不考虑优化指标,只考虑排序正确性时的Lambda梯度

将Ui,Uj的位置调换,计算NDCG的变化 ,然后将乘到上一步得到的Lambda梯度上,就是优化NDCG所需要的Lambda梯度

以上公式中,在TF-RANKING的代码中被称为Lambda weight。优化不同的指标,将会定义不同的Lambda weight。

打分。本来很简单的一个步骤,将query-doc, user-item的特征喂进模型,模型的输出就是我们需要的分数,预测query-doc, user-item的相关程度。但是TF-Ranking却使用了所谓的Groupwise Scoring Function(GSF),给一个候选物料 打分时,要把它与其他候选物料编组,并且要考虑在组内的不同位置上,然后给这一组同时打分。我觉得是一个华而不实、得不偿失的功能。

Lambda Weight。如前所述,根据需要优化的指标(NDCG/MAP/MRR)的不同,需要定义不同的lambda weight,乘在 的前面

The "Mean" in MAP

¶
Logo走马观花Google TF-Ranking的源代码知乎专栏
LogoDeep learning solution for netflix prizekarthkk
Mean Average Precision (MAP) For Recommender SystemsEvening Session
http://sofasofa.io/forum_main_post.php?postid=1002561
w_k
\frac{\partial{s_i}}{\partial{w_k}}
\sigma
\bar{P_{ij}}
S_{ij}\in[-1,0,1]
S_{ij}
\frac{\partial{C}}{\partial{s_i}}=-\frac{\partial{C}}{\partial{s_j}}
\lambda_{ij}
s_i
\lambda_{ij}
\frac{\partial{s_j}}{\partial{w_k}}
\left| \Delta NDCG \right|
\left| \Delta NDCG \right|
\left| \Delta NDCG \right|
item_i
item_i
\frac{\partial{C}}{\partial{s_i}}