对数据库存储的信息实施复杂的算法

发布于 2024-09-08 20:43:06 字数 291 浏览 5 评论 0原文

我试图找出对关系数据库中存储的信息实施复杂算法的最佳实践。

具体来说:我想在包含许多文档的 TFxIDF 向量的大型 MS SQL Server 数据库上实现 k-means 算法(文档聚类算法)的变体(这些向量用作算法的输入)。

我的第一个想法是使用存储过程、函数、视图和所有其他基本 SQL Server 工具在 SQL 中完成整个操作,但后来我想也许我应该编写将在SQL 服务器。

性能是这里的一个问题,所以我也需要考虑这一点。

我将不胜感激任何关于我应该走的道路的建议。

谢谢你!

I'm trying to figure out the best practice for implementing a complex algorithm on stored information in a relational DB.

Specifically: I want to implement a variation of the k-means algorithm (a document clustering algorithm) on a large MS SQL Server database containing TFxIDF vectors of many documents (these vectors are used as input for the algorithm).

My first thought was doing the entire thing in SQL using stored procedures, functions, views and all the other basic SQL Server tools, but then I thought maybe I should write managed code (I'm fluent in C#) that will be executed on the SQL Server.

Performance is an issue here, so I need to take that in consideration also.

I would appreciate any advice on the path I should take.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦一生花开无言 2024-09-15 20:43:06

性能是一个问题

而且始终如此。在查看此类代码时,您必须考虑两种相反的趋势:

  • 由于索引、缓存和其他优化技术,数据库服务器通常最适合快速进行这些计算。你似乎明白这一点。

另一方面:

  • 这些计算很少单独发生。您必须考虑整个服务器的性能,并且您的数据库通常是数据中心中负载最重的服务器。从技术和业务角度来看,它也是最难扩展的。技术性的,因为您必须平衡多个不同的组件,包括磁盘、RAM 和 CPU,而且了解瓶颈在哪里并不总是那么容易。此外,这些机器往往是“大型”机器,组织中没有多少人有调优经验。最后,它们通常不能很好地扩展。您无法像添加应用程序服务器那样轻松地添加另一个数据库服务器来分担负载。从商业角度来看,所有这些技术上的繁琐内容都会增加成本。不仅如此,数据库许可证本身通常每个 CPU 都有数千个。

将这两点放在一起,获得性能的最佳方法通常是使用数据库中的查询功能来提取您真正需要的记录子集,并且可能进行一些更简单的预处理 - 简单的预处理水果,如果你愿意的话。然后在应用程序服务器上完成繁重的工作,如果可能的话并行完成。

Performance is an issue here

It always is. When looking at this kind of code, there are two opposing trends that you have to consider:

  • Thanks to indexing, caching, and other optimization techniques, the database server is often best positioned to make these calculation quickly. You seem to understand this.

On the other hand:

  • These calculations seldom happen in isolation. You have to take the whole server performance into account, and your database is typically the most loaded server in your data center. It's also the hardest to scale, both from a technical and business perspective. Technical because you have to balance several different components, including disk, RAM, and cpu, and it's not always easy to know where your bottlenecks are. Also, these tend to be "big" machines that not many in your organization will have experience tuning. Finally, they don't often scale out very well. You can't add another database server to share the load as easily as you could an application server. From a business standpoint, all that technical mumbo jumbo adds up to cost. More than that, the database license is itself often several thousands per cpu.

Take these two points together, and the best course for performance is typically to use the querying capabilities in the database to pull down just the subset of records that you really need, and maybe do some of the easier pre-processing — the low-hanging fruit, if you will. Then finish the heavy lifting on an application server, in parallel if possible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文