建议将 R 与 SimpleDB 或 BigQuery 结合使用，或将 PHP 与 SimpleDB 结合使用

发布于 2024-11-30 13:25:16 字数 674 浏览 5 评论 0原文

我目前正在开发生成产品推荐的系统，例如 Amazon 上的产品推荐：“买了这个的人也买了这个..”

当前场景：

提取客户的 Google Analytics 数据并将其插入在数据库中。
在客户端的网站上，加载产品页面时，将进行 API 调用以获取正在查看的产品的推荐。
当 API 收到产品 ID 作为请求时，它会在数据库中查找并检索（使用关联规则）推荐的产品 ID 并将其作为响应发送。
这些产品 ID 的列表将在客户端进行处理以获取产品详细信息（图像、价格..）并显示在网站上。
目前我正在使用带有gapi包和REST api的PHP和MYSQL AMAZON EC2 上的存储。

我的问题是： 现在，如果我必须在以下选项中进行选择，那么这将是实现上述概念的最佳选择。

PHP 与 SimpleDB 或 BIGQuery。
R 语言与 BIGQuery。
RHIPE-（R 和 hadoop ）与 SimpleDB。
Apache Mahout。

请帮忙！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

影子的影子 2024-12-07 13:25:16

这个问题不太容易回答，因为约束条件相当专业。

不过，可以考虑以下因素：

BIGQuery 尚未公开。因此，由于使用基数较小，即使您处于预览人群中，也很难获得改进建议。
您的每个答案都询问了建模系统和建模系统。一个存储系统。 Apache Mahout 不是一种存储机制，因此它不一定可以单独工作。我曾经认为它的机器学习实现是一些 Google Summer of Code 的模仿，但我根据评论者的建议更新了这一观点。看起来它对不同算法的覆盖仍然相当不均匀和参差不齐，而且还不清楚这些组件是如何支持或维护的。我鼓励 Mahout 的布道者解决这个问题。

结果，这消除了第一个、第二个和第四个选项。

我不太明白的是需要一个实时服务器来利用 Hadoop 和 RHIPE。这应该在开发推荐模型的批处理中完成，而不是实时完成。我想您可以使用 RHIPE 作为一个简单的一站式前端来触发查询。

我建议使用 RApache 而不是 RHIPE，因为您可以预加载包和模型。我认为在前端使用 Hadoop 没有任何优势，但它对于模型拟合来说是一个非常自然的后端系统。

（更新 1）其他界面选项包括 RServe (http://www.rforge.net/Rserve/) 以及可能的服务器模式下的 RStudio。有 R/PHP 接口（见下面的评论），但我怀疑通过 HTTP 或 TCP/IP 访问 R 会更好。

（更新 2）针对整个过程，我看到的基本想法是，您可以从 PHP 查询数据并传递给 R，或者，如果您希望从 R 内部查询，请查看注释中的链接（指向 OmegaHat 工具））或发布有关 R & 的新问题SimpleDB - 我确信 SO 上的其他人能够更好地了解这个特定的连接。 RApache 可以让您实例化许多已准备好的 R 进程，这些进程已加载包并在 RAM 中存储数据；因此，您只需要传递需要用于预测的任何数据。如果您的新数据是一个小向量，那么 RApache 应该没问题，而且这对于实时处理的数据来说似乎是正确的。

This isn't so easy to answer, because the constraints are fairly specialized.

The following considerations can be made, though:

BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.

As a result, this eliminates the 1st, 2nd, and 4th options.

What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.

I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.

(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.

(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.

回复收藏 0 原文

蓝眸 2024-12-07 13:25:16

如果您想要一个基于数据库中的数据进行推荐的实时 API，Apache Mahout 可以直接执行此操作。您想要使用 ReloadFromJDBCDataModel，在 GenericItemBasedRecommender 之上放置一个 GenericItemBasedRecommender，并在 examples 模块中使用基于 servlet 的包装器。熟悉代码并根据您的需求对其进行自定义可能需要一两天的时间，但这非常简单。

当您超过大约 100M 数据点时，您将需要考虑分布式计算 Hadoop。这要复杂一些。 Mahout 也有一个分布式推荐器，您可以对其进行自定义。

回复收藏 0 原文

~没有更多了~