正则表达式和 linq 哪个更适合解析?

发布于 2024-12-09 06:05:41 字数 315 浏览 2 评论 0原文

我正在将网页解析为 Windows Phone 7,我需要知道什么是更好的方法来做到这一点。最重要的是性能。我在 IMDB 的示例,作者使用正则表达式,但我不确定如果我使用 Html 是否会更好敏捷包和 Linq。

PS:我必须解析网站,但这不是我的网站。

I am parsing webpage to windows phone 7 and I need to know what is better way to do this. The most important is the performance. I saw in example with imdb that the author uses regex but I am not sure if It woudn´t be better if I use Html Agility Pack and Linq.

P.s.: I must parse website and it´s not my website.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

人生戏 2024-12-16 06:05:41

使用 Html Agility Pack 和 Linq 将为您提供最佳服务。

使用 RegEx 解析 HTML 非常不可靠。

You'll be best served using the Html Agility Pack and Linq.

Parsing HTML with RegEx is quite unreliable.

你穿错了嫁妆 2024-12-16 06:05:41

碰巧我正在研究类似的主题。我不告诉你任何权威的说法,因为现在还为时过早。首先,我使用了 3 个引擎:

当然,还有很多选择(连我也写过一次) Palm OS 的简单 html 查看器),但这似乎是一个好的开始。

Majestic 不提供 Html->text 转换,只是提供如何遍历 html 字符串的示例代码。首先,我实现了简单的转换:

  • 写入所有文本节点
  • 转换

    到“\n\n”和
    到“/n”

  • 忽略其他一切

然后我收集了 50 多个 html 文件的样本并使用所有 3 种方法转换它们。我不得不说我对这两种方法都不满意。两个一般观察结果:

  • Majestic 和 Agility 的结果非常相似,
  • 正则表达式方法慢了一个数量级。

所以我查看了正则表达式代码,发现底部有一个无意义的循环。经过简单优化后,Regex 方法仅慢了约 25%。鉴于它进行了 30 多个复杂的正则表达式替换,我认为这是一个很好的结果。

然后我编写了一个测试 html 文件,其中包含所有常见的 html 标签以及更多内容。和之前一样,雄伟和敏捷的表现类似。

  • 所有引擎正常:h1、p、以文本形式写入的标签
  • 所有引擎失败:h2+、hr、b
  • br:Regex 失败、Majestic 正常
  • 列表:Regex 正常、Majestic 失败
  • 简单 2x2 表:Regex 正常、Majestic 失败

还有很多需要测试。例如编码。

目前我只想说 Regex 似乎是一个更好的选择。然而,上述发动机的性能均不令人满意。从积极的方面来说,调整这些引擎(特别是 Majestic 和 Regex)很容易。也许同样的情况也适用于敏捷性,但是,我没有深入研究这个包来这么说。

By chance I am working on a similar subject. I don't tell you any authoritative statement as it is too early. To start with I took 3 engines:

Of course, there are a lot more options (even I wrote once a simple html viewer for Palm OS), but this seemed to be a good start.

Majestic did not offer Html->text conversion, just a sample code how to walk over the html string. To start with I implemented trivial conversion:

  • Write all text nodes
  • Convert <p> to "\n\n" and <br> to "/n"
  • Ignoring everything else

Then I collected a sample of 50+ html files and converted them using all 3 methods. I have to say that I wasn't happy with either method. Two general observations:

  • Results from Majestic and Agility were remarkably similar
  • Regex method was an order of magnitude slower.

So I looked into the Regex code and found a nonsense loop at the bottom. After an easy optimization Regex method was only ~25% slower. Given that it makes more than 30 complex Regex replacements, I considered this a good result.

Then I wrote a test html file containing all common html tags and a bit more. As before, Majestic and Agility performed similarly.

  • All engines ok: h1, p, tags written as text
  • All engines failed: h2+, hr, b
  • br: Regex failed, Majestic ok
  • Lists: Regex ok, Majestic failed
  • Simple 2x2 table: Regex ok, Majestic failed

There's a lot more to test. For example encoding.

At this moment I would only say that Regex seems to be a better alternative. However, none of the mentioned engines performs satisfactorily. On the positive note, tweaking these engines (particularly Majestic and Regex) is easy. Maybe the same holds true for Agility as well, however, I did not look into the package deep enough to say that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文