我正在尝试编写一个脚本,该脚本将读取远程 sitemap.xml 并解析其中的 url,然后依次加载每个脚本以预先缓存它们以加快浏览速度。
其背后的原因是:我们正在开发的系统将 DITA XML 即时写入浏览器,第一次加载页面时,等待时间可能在 8-10 秒之间。此后的后续加载只需 1 秒。显然,为了获得更好的用户体验,预缓存页面是一个好处。
每次我们在此服务器上准备新出版物或执行任何测试/修补时,我们都必须清除缓存,因此我们的想法是编写一个脚本来解析站点地图并加载每个网址。
经过一些阅读后,我决定最好的途径是使用 PHP 和 PHP。卷曲。我不知道这是否是一个好主意。我更熟悉 Perl,但目前系统上既没有安装 PHP 也没有安装 Perl,所以我想尝试一下 PHP 池可能会很好。
到目前为止,我从“teh internets”获取的代码读取 sitemap.xml 并将其写入我们服务器上的 xml 文件,并在浏览器中显示它。据我所知,这只是一次性转储整个文件?
<?php
$ver = "Sitemap Parser version 0.2";
echo "<p><strong>". $ver . "</strong></p>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://ourdomain.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$xml = curl_exec ($ch);
curl_close ($ch);
if (@simplexml_load_string($xml)) {
$fp = fopen('feed.xml', 'w');
fwrite($fp, $xml);
echo $xml;
fclose($fp);
}
?>
与其将整个文档转储到文件或屏幕中,不如遍历 xml 结构并获取我需要的 url。
xml 的格式如下:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9	http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4</loc>
<lastmod>2011-03-31T11:25:01.984+01:00</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_9</loc>
<lastmod>2011-03-31T11:25:04.734+01:00</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
我尝试过使用 SimpleXML:
curl_setopt($ch, CURLOPT_URL, 'http://onlineservices.letterpart.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec ($ch);
curl_close ($ch);
$xml = new SimpleXMLElement($data);
$url = $xml->url->loc;
echo $url;
这将第一个 url 打印到屏幕上,这是个好消息!
http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4
我的下一步是尝试读取文档中的所有位置,所以我尝试了:
foreach ($xml->url) {
$url = $xml->url->loc;
echo $url;
}
希望如此会抓取网址中的每个位置,但它什么也没产生,我被困在这里了。
请有人指导我获取多个父母的孩子,然后加载此页面并缓存它的最佳方法(我假设是一个简单的 GET)?
我希望我已经提供了足够的信息。如果我遗漏了什么(除了实际编写 PHP 的能力之外。请说;-)
谢谢。
I am trying to write a script that will read a remote sitemap.xml and parse the url's within it, then load each one in turn to pre-cache them for faster browsing.
The reason behind this: The system we are developing writes DITA XML to the browser on the fly and the first time a page is loaded the wait can be between 8-10 seconds. Subsequent loads after that can be as little as 1 second. Obviously for a better UX, pre-cached pages are a bonus.
Every time we prepare a new publication on this server or perform any testing/patching, we have to clear the cache so the idea is to write a script that will parse through the sitemap and load each url.
After doing a bit of reading I have decided that the best route is to use PHP & Curl. Whether this is a good idea or not I don't know. I'm more familier with Perl but neither PHP nor Perl are installed on the system at present so I thought it might be nice to dip my toes in the PHP pool.
The code I have grabbed off "teh internets" so far reads the sitemap.xml and writes it to a xml file on our server as well as displaying it in the browser. As far as I can tell this is just dumping the entire file in one go?
<?php
$ver = "Sitemap Parser version 0.2";
echo "<p><strong>". $ver . "</strong></p>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://ourdomain.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$xml = curl_exec ($ch);
curl_close ($ch);
if (@simplexml_load_string($xml)) {
$fp = fopen('feed.xml', 'w');
fwrite($fp, $xml);
echo $xml;
fclose($fp);
}
?>
Rather than dumping the entire document into a file or to the screen it would be better to traverse the xml structure and just grab the url I require.
The xml is in this format:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4</loc>
<lastmod>2011-03-31T11:25:01.984+01:00</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_9</loc>
<lastmod>2011-03-31T11:25:04.734+01:00</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
I have tried using SimpleXML:
curl_setopt($ch, CURLOPT_URL, 'http://onlineservices.letterpart.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec ($ch);
curl_close ($ch);
$xml = new SimpleXMLElement($data);
$url = $xml->url->loc;
echo $url;
and this printed the first url to the screen which was great news!
http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4
My next step was to try and read all of the locs in the document so I tried:
foreach ($xml->url) {
$url = $xml->url->loc;
echo $url;
}
hoping this would grab each loc within the url but it produced nothing and here I am stuck.
Please could someone guide me towards grabbing the child of multiple parents and then the best way to load this page and cache it which i am assuming is a simple GET?
I hope I have provided enough info. If I'm missing anything (apart from the ability to actually write PHP. please say ;-)
Thanks.
发布评论
评论(3)
您似乎没有任何价值来保存 foreach 的结果:
You don't appear to have any value to hold the result of the foreach:
您不需要使用curl,使用
simplexml_load_file($sitemap_URL)
...或者使用simplexml_load_string()和file_get_contents()和stream_context_create()来完成比GET更复杂的操作。...并且不需要 DOM 遍历。
用一行解析为数组!
作为 http://www.sitemaps.org/protocol.html XML 描述,它是一个简单的具有良好数组表示的树。
您可以使用json XML阅读器,
所以使用例如。
foreach($array['image:image'] as $r)
遍历它(通过var_dump($array)
检查)...另请参阅 oop5.iterations。PS:您还可以通过 XPath 在 simplexml 中进行先前的节点选择。
You not need to use curl, use
simplexml_load_file($sitemap_URL)
... or use simplexml_load_string() with file_get_contents() with stream_context_create(), for something more complex than GET.... And not need DOM traverse.
Parse as array with one line!
As http://www.sitemaps.org/protocol.html XML description, it is a simple tree with good array representation.
You can use a json XML reader,
So use eg.
foreach($array['image:image'] as $r)
to traverse it (check byvar_dump($array)
)... see also oop5.iterations.PS: you can also do a previous node selection by XPath at simplexml.
您还可以使用 PHP 简单大型 XML 解析器 (http://www.phpclasses.org/package/5667-PHP-Parse-XML-documents-and-return-arrays-of-elements.html)主要是在大小为站点地图太大。
You can also use PHP Simple Large XML Parser (http://www.phpclasses.org/package/5667-PHP-Parse-XML-documents-and-return-arrays-of-elements.html) mainly in case where the size of sitemap is too large.