使用 file_get_contents 解析 html 表到 php 数组
我正在尝试将 此处 显示的表解析为多维 php 数组。我正在使用以下代码,但由于某种原因它返回一个空数组。在网上搜索后,我发现 这个site 这是我从中获取 parseTable() 函数的地方。通过阅读该网站上的评论,我发现该功能运行良好。所以我假设我从 file_get_contents() 获取 HTML 代码的方式有问题。对我做错了什么有什么想法吗?
<?php
$data = file_get_contents('http://flow935.com/playlist/flowhis.HTM');
function parseTable($html)
{
// Find the table
preg_match("/<table.*?>.*?<\/[\s]*table>/s", $html, $table_html);
// Get title for each row
preg_match_all("/<th.*?>(.*?)<\/[\s]*th>/", $table_html[0], $matches);
$row_headers = $matches[1];
// Iterate each row
preg_match_all("/<tr.*?>(.*?)<\/[\s]*tr>/s", $table_html[0], $matches);
$table = array();
foreach($matches[1] as $row_html)
{
preg_match_all("/<td.*?>(.*?)<\/[\s]*td>/", $row_html, $td_matches);
$row = array();
for($i=0; $i<count($td_matches[1]); $i++)
{
$td = strip_tags(html_entity_decode($td_matches[1][$i]));
$row[$row_headers[$i]] = $td;
}
if(count($row) > 0)
$table[] = $row;
}
return $table;
}
$output = parseTable($data);
print_r($output);
?>
我希望我的输出数组看起来像这样:
1 --> 11:33AM --> DEV --> IN THE DARK 2 --> 11:29AM --> LIL' WAYNE --> SHE WILL 3 --> 11:26AM --> KARDINAL OFFISHALL --> NUMBA 1 (TIDE IS HIGH)
I am trying to parse the table shown here into a multi-dimensional php array. I am using the following code but for some reason its returning an empty array. After searching around on the web, I found this site which is where I got the parseTable() function from. From reading the comments on that website, I see that the function works perfectly. So I'm assuming there is something wrong with the way I'm getting the HTML code from file_get_contents(). Any thoughts on what I'm doing wrong?
<?php
$data = file_get_contents('http://flow935.com/playlist/flowhis.HTM');
function parseTable($html)
{
// Find the table
preg_match("/<table.*?>.*?<\/[\s]*table>/s", $html, $table_html);
// Get title for each row
preg_match_all("/<th.*?>(.*?)<\/[\s]*th>/", $table_html[0], $matches);
$row_headers = $matches[1];
// Iterate each row
preg_match_all("/<tr.*?>(.*?)<\/[\s]*tr>/s", $table_html[0], $matches);
$table = array();
foreach($matches[1] as $row_html)
{
preg_match_all("/<td.*?>(.*?)<\/[\s]*td>/", $row_html, $td_matches);
$row = array();
for($i=0; $i<count($td_matches[1]); $i++)
{
$td = strip_tags(html_entity_decode($td_matches[1][$i]));
$row[$row_headers[$i]] = $td;
}
if(count($row) > 0)
$table[] = $row;
}
return $table;
}
$output = parseTable($data);
print_r($output);
?>
I want my output array to look something like this:
1 --> 11:33AM --> DEV --> IN THE DARK 2 --> 11:29AM --> LIL' WAYNE --> SHE WILL 3 --> 11:26AM --> KARDINAL OFFISHALL --> NUMBA 1 (TIDE IS HIGH)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不要用正则表达式解析 HTML 来削弱自己! 相反,让 HTML 解析器库为您担心标记的结构。
我建议您查看 Simple HTML DOM (http://simplehtmldom.sourceforge.net/)。它是专门为帮助解决 PHP 中的此类网页抓取问题而编写的库。通过使用这样的库,您可以用更少的代码行编写抓取内容,而不必担心创建有效的正则表达式。
原则上,使用简单的 HTML DOM,您只需编写如下内容:
然后可以扩展它以某种格式捕获数据,例如创建艺术家和相应标题的数组,如下所示:
我们可以看到此代码可以(简单地)更改为以任何其他方式重新格式化数据。
Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.
I suggest you to check out Simple HTML DOM (http://simplehtmldom.sourceforge.net/). It is a library specifically written to aid in solving this kind of web scraping problems in PHP. By using such a library, you can write your scraping in much less lines of code without worrying about creating working regexps.
In principle, with Simple HTML DOM you just write something like:
This can be then extended to capture your data in some format, for instance to create an array of artists and corresponding titles as:
We can see that this code can be (trivially) changed to reformat the data in any other way as well.
我尝试了 simple_html_dom 但在较大的文件和重复调用该函数时,我在 php 5.3 (GAH) 上得到 zend_mm_heap_corrupted。我也尝试过 preg_match_all (但是这在较大的 html 文件(5000)行上失败了,而我的 HTML 表只有大约 400 行。
我正在使用它,它工作得很快,并且不会出现错误。
这段代码运行良好为我。
原始代码的示例在这里。
http://techgossipz .blogspot.co.nz/2010/02/how-to-parse-html-using-dom-with-php.html
I tried simple_html_dom but on larger files and on repeat calls to the function I am getting zend_mm_heap_corrupted on php 5.3 (GAH). I have also tried preg_match_all (but this has been failing on a larger file (5000) lines of html, which was only about 400 rows of my HTML table.
I am using this and its working fast and not spitting errors.
This code worked well for me.
Example of original code is here.
http://techgossipz.blogspot.co.nz/2010/02/how-to-parse-html-using-dom-with-php.html