通过链接获取网站标题
请注意 Google 新闻如何在每篇文章摘录的底部提供来源。
卫报 - ABC 新闻 - 路透社 - 彭博社
我正在尝试模仿这一点。
例如,提交 URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/
后,我想返回 华盛顿时报
这怎么可能用 php 实现呢?
Notice how Google News has sources on the bottom of each article excerpt.
The Guardian - ABC News - Reuters -
Bloomberg
I'm trying to imitate that.
For example, upon submitting the URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/
I want to return The Washington Times
How is this possible with php?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我的答案是对@AI W 使用页面标题的答案的扩展。下面是实现他所说的代码。
输出
正如您所看到的,这并不完全是 Google 使用的,因此这让我相信他们获取了 URL 的主机名并将其与自己的列表相匹配。
http://www.washingtontimes.com/ =>华盛顿时报
My answer is expanding on @AI W's answer of using the title of the page. Below is the code to accomplish what he said.
OUTPUT
As you can see, it is not exactly what Google is using, so this leads me to believe that they get a URL's hostname and match it to their own list.
http://www.washingtontimes.com/ => The Washington Times
输出:
显然,您还应该实施基本的错误处理。
Output:
Obviously you should also implement basic error handling.
使用域主页上的 get_meta_tags() ,《纽约时报》会带回一些可能需要截断但可能有用的内容。
包括描述“《华盛顿时报》就影响我们国家未来的问题发表突发新闻和评论。”
Using get_meta_tags() from the domain home page, for NYT brings back something which might need truncating but could be useful.
includes the description 'The Washington Times delivers breaking news and commentary on the issues that affect the future of our nation.'
您可以获取 URL 的内容并对
title
元素的内容进行正则表达式搜索。或者,如果您不想使用正则表达式(以匹配非常靠近文档顶部的内容),您可以使用 DOMDocument 对象:
我让您自行决定您最喜欢哪种方法。
You could fetch the contents of the URL and do a regular expression search for the content of the
title
element.Or, if you don't want to use a regular expression (to match something very near the top of the document), you could use a DOMDocument object:
I leave it up to you to decide which method you like best.
cURL 上的 PHP 手册
关于 Perl 正则表达式匹配的 PHP 手册
将这两个放在一起:
我不能保证这个例子会工作,因为我不这里没有 PHP,但它应该可以帮助您入门。
PHP manual on cURL
PHP manual on Perl regex matching
And putting those two together:
I can't promise this example will work since I don't have PHP here, but it should help you get started.
不必要时我会尽量避免使用正则表达式,我已经创建了一个函数来使用下面的curl 和 DOMDocument 获取网站标题。
上面返回以下内容:欢迎使用 Facebook - 登录、注册或了解更多
I try to avoid regular expressions when it isn't necessary, I have made a function to get the website title with curl and DOMDocument below.
above returns the following: Welcome to Facebook - Log In, Sign Up or Learn More
或者,您可以使用 简单 Html Dom 解析器:
Alternatively you can use Simple Html Dom Parser:
我编写了一个函数来处理它:
它获取网页内容,并尝试通过 ((从最高优先级到最低优先级) 获取文档字符集编码:
(请参阅 http://www.w3.org/TR/html4/charset.html)
然后使用
iconv
进行转换标题为utf-8
编码。i wrote a function to handle it:
it fetches the webpage content, and tries to get document charset encoding by ((from highest priority to lowest):
(see http://www.w3.org/TR/html4/charset.html)
and then uses
iconv
to convert title toutf-8
encoding.通过链接获取网站标题并将标题转换为utf-8字符编码:
https://gist.github.com /kisexu/b64bc6ab787f302ae838
Get title of website via link and convert title to utf-8 character encoding:
https://gist.github.com/kisexu/b64bc6ab787f302ae838
简单但需要一些时间:
我还没有尝试过其他人提出的答案来比较性能,但你应该这样做。
Simple but it takes some time:
I haven't tried the proposed answers by others here to compare for performance, but you should do.