推荐系统实验
我正在阅读《集体智能编程》一书,并尝试将我所学到的知识应用到 Northwind 数据库中。尽管我对所提出的算法的理解还没有信心,但我开始了解一般概念。
使用 Northwind 数据库,我尝试使用以下伪逻辑显示“购买此产品的客户也购买了 XYZ”的列表:
- 查找也购买了我的商品的其他客户
- 查找这些客户购买的所有其他商品
- 根据以下条件对商品进行排名购买计数
- 返回上一步中的前 N 件商品
我正在处理以下查询:
declare
@customerid nchar(5),
@productid int;
set @customerid = 'ALFKI';
set @productid = 59;
-- find other products from customers who
-- also purchased my productid
select top 10
od.productid, c.categoryname, p.productname, p.unitsonorder, count(od.productid)
from
[order details] od
inner join orders o on o.orderid = od.orderid
inner join products p on p.productid = od.productid
inner join categories c on c.categoryid = p.categoryid
where
o.customerid <> @customerid and
od.productid <> @productid and
p.discontinued = 0
group by
od.productid, c.categoryname, p.productname, p.unitsonorder
order by 5 desc,4 desc
我认为下一步是分解查询,以便我可以根据最近的购买(而不是所有历史购买)进行过滤并限制客户匹配 N 个客户,而不是所有购买我产品的客户。任何人都可以提供任何指示吗?我的方向正确吗?我应该采取完全不同的方向吗?
在这一点上,我的目标是性能而不是准确性,因为我知道我还没有经验来应用算法来充分发挥其作用。我只是想应用这个概念。一旦我对自己的理解足够满意,我打算使用更真实的客户数据针对更大的数据库测试此查询。
I'm reading the book Programming Collective Intelligence and trying to apply what I've learned against the Northwind database. Though I'm not yet confident in my understanding of the algorithms presented, I'm starting to get an idea of the general concepts.
Using the Northwind database, I'm attempting to show a list of "customers who bought this also bought XYZ" using the following pseudo-logic:
- Find other customers who also purchased my item
- Find all other items purchased by those customers
- Rank items based on purchase count
- Return top N items from previous step
I'm working with the query below:
declare
@customerid nchar(5),
@productid int;
set @customerid = 'ALFKI';
set @productid = 59;
-- find other products from customers who
-- also purchased my productid
select top 10
od.productid, c.categoryname, p.productname, p.unitsonorder, count(od.productid)
from
[order details] od
inner join orders o on o.orderid = od.orderid
inner join products p on p.productid = od.productid
inner join categories c on c.categoryid = p.categoryid
where
o.customerid <> @customerid and
od.productid <> @productid and
p.discontinued = 0
group by
od.productid, c.categoryname, p.productname, p.unitsonorder
order by 5 desc,4 desc
I think my next step would be to break the query up so that I can filter based on recent purchases (as opposed to all historical purchases) and limit customer matching to N customers, rather than ALL customers who purchased my product. Can anyone provide any pointers? Am I headed in the right direction? Should I take a different direction entirely?
At this point, my goal is performance over accuracy, as I know I do not yet have the experience to apply the algorithms to their fullest benefit. I'm merely trying to apply the concept. Once I'm satisfied I understand it well enough, I intend to test this query against a larger database with more realistic customer data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这通常被称为“市场篮子”分析或“亲和力分组”,它并不像听起来那么简单,主要是由于大量的组合。首先要考虑的是能够导致阳性检测的最少病例数是多少。例如,假设我们有 100 万个购物篮,购买产品 A 的 10 个人也在同一个购物篮中购买了产品 B。 10个案例足以说明“买了A的人也买了B”吗?因此,使用“最近”购买的商品有点棘手。这个想法是创建一个产品对表,但如果我们有 N 个产品,组合数量为 N*(N-1),因此对于 1000 个产品的商店,我们将有 999000 个组合,因此修剪的算法是不简单。
另一件需要考虑的事情是物品的顺序和每件物品的价值。例如,购买自行车的顾客可能经常购买(添加)“LED 灯”。但是,如果将 LED 灯放入篮子中,系统是否应该在列表中提供自行车?
考虑到您正在使用 SQL Server,我建议您使用分析服务数据挖掘为此目的使用“MS关联算法”。通过使用相同的数据,您可以将您的结果与“商业解决方案”进行比较。
This is usually called "market basket" analysis or "affinity grouping" and it is not as simple as it sounds, mostly due to large number of combinations. First thing to consider is what is the minimum number of cases that would result in a positive detection. For example let's say we have 1M baskets and 10 people who bought product A also bought product B in the same basket. Is 10 cases enough to say that "people who bought A also bought B"? Due to this, using "recent" purchases is a bit tricky. The idea is to create a table of product pairs, but if we have N products, number of combinations is N*(N-1), so for a store of 1000 product, we would have 999000 combinations, so algorithm which prunes this is not simple.
One more thing to consider is the order of items and value of each one. For example customers buying bicycles may often buy (add) a "LED light". But, if one puts a LED light in a basket, should the system offer a bicycle in the list?
Considering that you are using SQL server, I would point you towards Analysis Services Data Mining which uses "MS Association Algorithm" for this purpose. By using same data, you can compare your results against a "commercial solution".