建议将 R 与 SimpleDB 或 BigQuery 结合使用,或将 PHP 与 SimpleDB 结合使用
我目前正在开发生成产品推荐的系统,例如 Amazon 上的产品推荐:“买了这个的人也买了这个..”
当前场景:
提取客户的 Google Analytics 数据并将其插入在数据库中。
在客户端的网站上,加载产品页面时,将进行 API 调用以获取正在查看的产品的推荐。
当 API 收到产品 ID 作为请求时,它会在数据库中查找并检索(使用关联规则)推荐的产品 ID 并将其作为响应发送。
这些产品 ID 的列表将在客户端进行处理以获取产品详细信息(图像、价格..)并显示在网站上。
目前我正在使用带有gapi包和REST api的PHP和MYSQL AMAZON EC2 上的存储。
我的问题是: 现在,如果我必须在以下选项中进行选择,那么这将是实现上述概念的最佳选择。
PHP 与 SimpleDB 或 BIGQuery。
R 语言与 BIGQuery。
RHIPE-(R 和 hadoop )与 SimpleDB。
Apache Mahout。
请帮忙!
I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api
storage on AMAZON EC2 .
My Question is:
Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题不太容易回答,因为约束条件相当专业。
不过,可以考虑以下因素:
结果,这消除了第一个、第二个和第四个选项。
我不太明白的是需要一个实时服务器来利用 Hadoop 和 RHIPE。这应该在开发推荐模型的批处理中完成,而不是实时完成。我想您可以使用 RHIPE 作为一个简单的一站式前端来触发查询。
我建议使用 RApache 而不是 RHIPE,因为您可以预加载包和模型。我认为在前端使用 Hadoop 没有任何优势,但它对于模型拟合来说是一个非常自然的后端系统。
(更新 1)其他界面选项包括 RServe (http://www.rforge.net/Rserve/) 以及可能的服务器模式下的 RStudio。有 R/PHP 接口(见下面的评论),但我怀疑通过 HTTP 或 TCP/IP 访问 R 会更好。
(更新 2)针对整个过程,我看到的基本想法是,您可以从 PHP 查询数据并传递给 R,或者,如果您希望从 R 内部查询,请查看注释中的链接(指向 OmegaHat 工具) )或发布有关 R & 的新问题SimpleDB - 我确信 SO 上的其他人能够更好地了解这个特定的连接。 RApache 可以让您实例化许多已准备好的 R 进程,这些进程已加载包并在 RAM 中存储数据;因此,您只需要传递需要用于预测的任何数据。如果您的新数据是一个小向量,那么 RApache 应该没问题,而且这对于实时处理的数据来说似乎是正确的。
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
如果您想要一个基于数据库中的数据进行推荐的实时 API,Apache Mahout 可以直接执行此操作。您想要使用
ReloadFromJDBCDataModel
,在GenericItemBasedRecommender
之上放置一个GenericItemBasedRecommender
,并在examples
模块中使用基于 servlet 的包装器。熟悉代码并根据您的需求对其进行自定义可能需要一两天的时间,但这非常简单。当您超过大约 100M 数据点时,您将需要考虑分布式计算 Hadoop。这要复杂一些。 Mahout 也有一个分布式推荐器,您可以对其进行自定义。
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use
ReloadFromJDBCDataModel
, put on top aGenericItemBasedRecommender
, and use the servlet-based wrapper in theexamples
module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.