加载时间:使用 PHP 的 DOMDocument 还是使用正则表达式解析 HTML 更快?
我正在将图像从我的 Flickr 帐户提取到我的网站,并且我使用了大约九行代码来创建一个用于提取图像的 preg_match_all 函数。
我已经读过好几次了,通过 DOM 解析 HTML 会更好。
就我个人而言,我发现通过 DOM 解析 HTML 更复杂。我用PHP的DOMDocument编写了一个类似的函数来拉取图像,大约有22行代码。创建花了一段时间,我不确定有什么好处。
每个代码的页面加载时间大约相同,因此我不确定为什么要使用 DOMDocument。
DOMDocument 比 preg_match_all 运行得更快吗?
如果您有兴趣,我将向您展示我的代码(您可以看到 DOMDocument 代码有多长):
//here's the URL
$flickrGallery = 'http://www.flickr.com/photos/***/collections/***/';
//below is the DOMDocument method
$flickr = new DOMDocument();
$doc->validateOnParse = true;
$flickr->loadHTMLFile($flickrGallery);
$elements = $flickr->getElementById('ViewCollection')->getElementsByTagName('div');
$flickr = array();
for($i=0;$i<$elements->length;$i++){
if($elements->item($i)->hasAttribute('class')&&$elements->item($i)->getAttribute('class')=='setLinkDiv'){
$flickr[] = array(
'href' => $elements->item($i)->getElementsByTagName('a')->item(0)->getAttribute('href'),
'src' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('src'),
'title' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('alt')
);
}
}
$elements = NULL;
foreach($flickr as $k=>$v){
$setQuery = explode("/",$flickr[$k]['href']);
$setQuery = $setQuery[4];
echo '<a href="?set='.$setQuery.'"><img src="'.$flickr[$k]['src'].'" title="'.$flickr[$k]['title'].'" width=75 height=75 /></a>';
}
$flickr = NULL;
//preg_match_all code is below
$sets = file_get_contents($flickrGallery);
preg_match_all('/(class="setLink" href="(.*?)".*?class="setThumb" src="(.*?)".*?alt="(.*?)")+/s',$sets,$sets,PREG_SET_ORDER);
foreach($sets as $k=>$v){
$setQuery = explode("/",$sets[$k][2]);
$setQuery = $setQuery[4];
echo '<a href="?set='.$setQuery.'"><img src="'.$sets[$k][3].'" title="'.$sets[$k][4].'" width=75 height=75 /></a>';
}
$sets = NULL;
I'm pulling images from my Flickr account to my website, and I had used about nine lines of code to create a preg_match_all function that would pull the images.
I've read several times that it is better to parse HTML through DOM.
Personally, I've found it more complicated to parse HTML through DOM. I made up a similar function to pull the images with PHP's DOMDocument, and it's about 22 lines of code. It took awhile to create, and I'm not sure what the benefit was.
The page loads at about the same time for each code, so I'm not sure why I would use DOMDocument.
Does DOMDocument work faster than preg_match_all?
I'll show you my code, if you're interested (you can see how lengthy the DOMDocument code is):
//here's the URL
$flickrGallery = 'http://www.flickr.com/photos/***/collections/***/';
//below is the DOMDocument method
$flickr = new DOMDocument();
$doc->validateOnParse = true;
$flickr->loadHTMLFile($flickrGallery);
$elements = $flickr->getElementById('ViewCollection')->getElementsByTagName('div');
$flickr = array();
for($i=0;$i<$elements->length;$i++){
if($elements->item($i)->hasAttribute('class')&&$elements->item($i)->getAttribute('class')=='setLinkDiv'){
$flickr[] = array(
'href' => $elements->item($i)->getElementsByTagName('a')->item(0)->getAttribute('href'),
'src' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('src'),
'title' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('alt')
);
}
}
$elements = NULL;
foreach($flickr as $k=>$v){
$setQuery = explode("/",$flickr[$k]['href']);
$setQuery = $setQuery[4];
echo '<a href="?set='.$setQuery.'"><img src="'.$flickr[$k]['src'].'" title="'.$flickr[$k]['title'].'" width=75 height=75 /></a>';
}
$flickr = NULL;
//preg_match_all code is below
$sets = file_get_contents($flickrGallery);
preg_match_all('/(class="setLink" href="(.*?)".*?class="setThumb" src="(.*?)".*?alt="(.*?)")+/s',$sets,$sets,PREG_SET_ORDER);
foreach($sets as $k=>$v){
$setQuery = explode("/",$sets[$k][2]);
$setQuery = $setQuery[4];
echo '<a href="?set='.$setQuery.'"><img src="'.$sets[$k][3].'" title="'.$sets[$k][4].'" width=75 height=75 /></a>';
}
$sets = NULL;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您愿意为了正确性而牺牲速度,那么请继续尝试使用正则表达式来构建您自己的解析器。
你说“就我个人而言,我发现通过 DOM 解析 HTML 更复杂”。您是否正在针对结果的正确性进行优化,或者您编写代码的容易程度如何?
如果您想要的只是速度和不复杂的代码,为什么不直接使用这个:
或者也许只是
那些在恒定时间内运行,并且它们很容易理解。没问题吧?
那是什么?您想要准确结果吗?然后不要使用正则表达式解析 HTML。
最后,当您使用像 DOM 这样的解析器时,您正在使用一段经过多年测试和调试的代码。当您编写自己的正则表达式来进行解析时,您正在使用必须自己编写、测试和调试的代码。为什么您不想使用许多人已经使用多年的工具?你认为你自己可以在忙碌中把工作做得更好吗?
If you're willing to sacrifice speed for correctness, then go ahead and try to roll your own parser with regular expressions.
You say "Personally, I've found it more complicated to parse HTML through DOM." Are you optimizing for correctness of results, or how easy it is for you to write the code?
If all you want is speed and code that's not complicated, why not just use this:
or maybe just
Those run in constant time, and they're easy to understand. No problem, right?
What's that? You want accurate results? Then don't parse HTML with regular expressions.
Finally, when you're working with a parser like DOM, you're working with a piece of code that has been well-tested and debugged for years. When you're writing your own regular expressions to do the parsing, you're working with code that you're going to have to write, test and debug yourself. Why would you not want to work with the tools that many people have been using for many years? Do you think you can do a better job yourself on the fly?
我会使用 DOM,因为如果对页面进行任何小的更改,它不太可能被破坏。
I would use DOM as this is less likely to break if any small changes are made to the page.