在 URL 中查找公司名称
给定一家知名公司的 URL(例如 http://mcdonalds.com/),您会如何自动并可靠地找到公司名称(在本例中为“Mc Donalds”)?
谢谢
编辑:有人投票结束了这个问题,所以也许我需要解释一下动机。我有一个很大的公司网址列表,我想使用 Google 地图查找有关每个公司的数据。使用公司名称搜索 Google 地图比使用 URL 搜索效果要好得多。
删除“http”和“com”在很多情况下确实有效,特别是对于知名公司,但并非全部。我发现 whois 记录不是很有帮助。
我希望有某种公共数据库可以将公司与 URL 相匹配,但到目前为止还没有遇到过。
given the URL of a well known company (eg http://mcdonalds.com/), how would you automatically and reliably find the company name (in this case "Mc Donalds")?
Thanks
Edit: someone voted to close this question, so maybe I need to explain the motivation. I have a large list of company URLs and I want to find data about each company using Google Maps. And searching Google Maps with the company name works much better than the URL.
Removing 'http' and 'com' does work in a lot of cases, particularly for well known companies, but not all. I found the whois records were not very helpful.
I was hoping there was some kind of public database matching companies to URLs, but haven't come across one so far.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您需要创建自己的查找表:您必须尝试从 URL 处的 html 中解析此信息以获得最准确的数据,例如:获取 Html 页面标题,或查找版权消息?
You would need to create your own Lookup Table: You would have to try and parse this information from the html at the URL for themost accurate data, eg: get the Html page Title, or look for the Copyright message?
他们很可能会将其放在</code> 元素中。解析它并将其与网站的域名进行比较。如果有明显的重叠,那就是你的匹配。如果没有,请尝试对标题进行一些启发(例如名称是 <code>>></code> 之前的所有内容)。
如果是一家较大的公司,那么您也可能很幸运地查看其域的 NIC 条目(又名 Whois)。
Quite probable they will have it in the
<title/>
element. Parse this and compare it to the website's domain. If there is a significant overlap, it is your match. If not, try some heuristics on the title (like name is everything before>>
or such).If it is a larger company, then you could also be lucky looking at the NIC entry (aka Whois) for their domain.
Whois 数据库可能会有所帮助,但总有一些边缘情况需要您付出更多努力来处理。
Whois database may be of some help, though there are always edge cases that you will have to handle with more effort.
如果你想要准确的话,我会说亚马逊机械土耳其人。
If you want to be accurate, I would say amazon mechanical turk.
另一种选择是使用 API,例如 https://developer .tuxx.co.uk/api-overview/company-name-api。在这里,您可以输入一个 URL,它会提取最可能的公司名称。
Another option would be to use an API, for example https://developer.tuxx.co.uk/api-overview/company-name-api. Here, you can enter an URL and it extracts the most probable company name.
尝试使用 cURL 和 DOMDocument。
看一下元标记
Try to use cURL and DOMDocument.
Take a look at the meta tag
<meta name="author" content="McDonald's Corporation" >
您可以使用 whois 信息。应该有一些库可以让您以干净的方式做到这一点。您没有提到您将使用什么类型的技术......
You could use the whois information. There should be libraries to let you do that in a clean way. You didnt mention what type of technology you'll be using...