正则表达式和 linq 哪个更适合解析?
我正在将网页解析为 Windows Phone 7,我需要知道什么是更好的方法来做到这一点。最重要的是性能。我在 IMDB 的示例,作者使用正则表达式,但我不确定如果我使用 Html 是否会更好敏捷包和 Linq。
PS:我必须解析网站,但这不是我的网站。
I am parsing webpage to windows phone 7 and I need to know what is better way to do this. The most important is the performance. I saw in example with imdb that the author uses regex but I am not sure if It woudn´t be better if I use Html Agility Pack and Linq.
P.s.: I must parse website and it´s not my website.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 Html Agility Pack 和 Linq 将为您提供最佳服务。
使用 RegEx 解析 HTML 非常不可靠。
You'll be best served using the Html Agility Pack and Linq.
Parsing HTML with RegEx is quite unreliable.
碰巧我正在研究类似的主题。我不告诉你任何权威的说法,因为现在还为时过早。首先,我使用了 3 个引擎:
当然,还有很多选择(连我也写过一次) Palm OS 的简单 html 查看器),但这似乎是一个好的开始。
Majestic 不提供 Html->text 转换,只是提供如何遍历 html 字符串的示例代码。首先,我实现了简单的转换:
到“\n\n”和
到“/n”
然后我收集了 50 多个 html 文件的样本并使用所有 3 种方法转换它们。我不得不说我对这两种方法都不满意。两个一般观察结果:
所以我查看了正则表达式代码,发现底部有一个无意义的循环。经过简单优化后,Regex 方法仅慢了约 25%。鉴于它进行了 30 多个复杂的正则表达式替换,我认为这是一个很好的结果。
然后我编写了一个测试 html 文件,其中包含所有常见的 html 标签以及更多内容。和之前一样,雄伟和敏捷的表现类似。
还有很多需要测试。例如编码。
目前我只想说 Regex 似乎是一个更好的选择。然而,上述发动机的性能均不令人满意。从积极的方面来说,调整这些引擎(特别是 Majestic 和 Regex)很容易。也许同样的情况也适用于敏捷性,但是,我没有深入研究这个包来这么说。
By chance I am working on a similar subject. I don't tell you any authoritative statement as it is too early. To start with I took 3 engines:
Of course, there are a lot more options (even I wrote once a simple html viewer for Palm OS), but this seemed to be a good start.
Majestic did not offer Html->text conversion, just a sample code how to walk over the html string. To start with I implemented trivial conversion:
Then I collected a sample of 50+ html files and converted them using all 3 methods. I have to say that I wasn't happy with either method. Two general observations:
So I looked into the Regex code and found a nonsense loop at the bottom. After an easy optimization Regex method was only ~25% slower. Given that it makes more than 30 complex Regex replacements, I considered this a good result.
Then I wrote a test html file containing all common html tags and a bit more. As before, Majestic and Agility performed similarly.
There's a lot more to test. For example encoding.
At this moment I would only say that Regex seems to be a better alternative. However, none of the mentioned engines performs satisfactorily. On the positive note, tweaking these engines (particularly Majestic and Regex) is easy. Maybe the same holds true for Agility as well, however, I did not look into the package deep enough to say that.