解析大型 XML 提要时处理重复数据

发布于 2024-10-29 21:20:43 字数 599 浏览 6 评论 0原文

我正在编写一个组件,它解析带有股票报价的 xml feed 并将结果保存在数据库中。问题相当简单,只是提要无法增量读取。也就是说,无法指定您只想更改最后 X 个报价或仅更改晚于 X 分钟的更改。我知道真正的问题是提要很愚蠢,提供商应该修复他们的东西,但这不是自动取款机的一个选项。

该提要是一个巨大的 xml 文件,其中包含提供商的 100000 个最新股票报价。该 feed 每分钟轮询一次,期间大约有 50-100 个更改的报价。其余的都是重复的引文,需要一遍又一遍地阅读。

在每次提要轮询期间,我将所有引号(使用 lxml)解析为对象。然后,对于每个报价对象,我检查数据库中是否已存在该报价。如果是,我就丢弃它,如果没有,我就保存它。这个过程非常浪费,因为只有大约 0.1% 是新数据,其余都是重复数据。为了稍微优化一下,我通过查询数据库一次以获取最近 X 小时内更新的报价来创建一个查找表。 (last_update, stock_id) 键上的引号在数据库中是唯一的,因此此优化将查询数量减少了约 50%。

但仍然有 50k 的数据库查询,每个报价都必须单独检查是否存在,这对数据库来说非常繁重。

所以我正在寻找的是如何使我的提要解析器更快的想法。也许有一种方法可以将最后获取的 xml 文件与新文件进行比较?

I'm writing a component which parses an xml feed with stock quotes and saves the result in a database. The problem is fairly straightforward, except that feed can not be read incrementallly. That is, there is no way to specify that you only want the X last quote changes or only changes newer than X minutes, say. I know the real problem is that the feed is stupid and that the provider should fix their stuff, but that is not an option atm.

The feed is a huge xml file which contains the 100000 last stock quotes for the provider. The feed is polled once every minute during which there is about 50-100 changed quotes. The rest is duplicate quotes which are read again and again and again.

During each poll of the feed, I parse all quotes (using lxml) to objects. Then, for each quote object, I check if the quote already exist in the database. If it does, I discard it and if it doesn't, I save it. This procedure is extremely wasteful since only about 0.1% is new data, there rest is duplicates. To optimize it a bit, I create a lookup table by querying the database once for quotes updated in the last X hours. The quotes are unique in the database on the (last_update, stock_id) key so this optimization reduces the number of queries by about 50%.

But there is still 50k db queries where each quote have to be checked individually if it exists or not which is very taxing on the database.

So what I'm looking for is ideas on how to make my feed parser faster. Maybe there is a way to diff the last fetched xml file with the new one?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

请止步禁区 2024-11-05 21:20:43

最新的项目是在 Feed 的顶部还是底部?如果它们位于顶部,那么当您看到第一个项目已经存在于中时,您可以停止解析
数据库。

如果最近的项目排在最后,您可以缓存报价键,然后在内存中查找它们,一旦找到非缓存的项目就开始访问数据库。或者,您可以记住放入数据库中的最后一个引用,并且在解析您查找的所有项目时,只访问数据库以查找其后的项目。

Are the most recent items at the top or the bottom of the feed? If they are at the top then you could stop parsing when you have seen the first item which is already present in the
database.

If the most recent items come last you could cache the quote keys and just look them up in memory and start hitting the database once you come to a non cached one. Or you could remember the last quote you put in the database and when parsing all the items you look for it and only hit the database for items after it.

迷爱 2024-11-05 21:20:43

您的问题分为两个方面:1)如何避免解析您不需要解析的内容,2)如何避免您也不需要的数据库操作。

如果引号本身非常小,您可能不会从尝试求解 (1) 中获得太多收益。否则,您可以创建一个过滤器(例如,使用 XSLT 或 SAX)来丢弃您不关心的引号,然后对其余部分进行完整的 DOM 解析。

要解决 (2),比较 XML 文件通常会很棘手,因为 XML 文档中的空白更改(在某些提供商中很常见)可能会导致误报,并且您通常需要分析实际 XML 结构的工具,不是简单的逐行文本差异。如果您认为这对您来说不是问题,您可以探索几个 Stack Overflow 主题,但我认为它们还将证明 XML 差异仍然是一个模糊的领域,特别是在开源领域:

另一种可行的方法是使用本地或分布式内存缓存来快速查找已更新的内容。您将受益于避免尝试过滤或比较您的内容,并且如果您正在构建长期基础设施,您可以轻松地调整您的缓存基础设施以适应其他用例。 OTOH,创建可扩展的分布式缓存基础设施并不是一个特别便宜的解决方案。

Your problem divides into two areas: 1) how to avoid parsing what you don't need to parse, and 2) how to avoid database operations that you don't need either.

If the quotes themselves are very small, you probably won't gain much from trying to solve (1). Otherwise, you could create a filter (using XSLT or SAX, for example) that would discard the quotes that you don't care about, and then do your full DOM parse on the rest.

To solve (2), diffing XML files is in general can be tricky because changes in whitespace in your XML document, all-too-common from some providers, can cause false positives, and you generally need something that analyzes the actual XML structure, not a simple textual line-by-line diff. If you don't think this will be a problem for you, there are several Stack Overflow topics you can explore, but I think they will also demonstrate that XML diffs are still a bit of a wooly area, particularly in the open source arena:

Another approach that could work would be to use local or distributed memory caching for speedy lookups of stuff that's already updated. You'd get the benefit of avoiding having to try and filter or diff your content, and you may readily be able to adapt your caching infrastructure for other use cases if you're building a long-term infrastructure. OTOH, creating a scalable distributed caching infrastructure is not a particularly cheap solution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文