搜索字符串算法
我试图从一组网站(数千个)的内容页面中获取联系信息。在摸不着头脑之前,我想问问像你们这样的专家。我所需要的只是地址、电子邮件 ID、电话号码和联系人信息(如果有)。
我想你已经明白这个问题了。是的,这是格式...由于网站没有遵循的标准格式,因此很难确定我需要的确切信息。有些网站设计有 Flash 联系我们页面,而其他一些网站则将联系信息设计为具有自定义字体的图像类型。
非常欢迎提示/想法/建议...
谢谢...
I am trying to get the contact information in the content pages from a set of web sites (thousands of them). I wanted to ask experts like you guys before scratching my head. All I need is the address, email ids, phone numbers and contact person information if available.
I think you understand the problem already. Yes it is the formatting... since there is no standard format that websites follows, its really hard to pin point the exact information that I need. Some websites are designed with flash contact us pages and some other websites designed the contact information as image types with custom fonts.
And hints/ideas/suggestions are mostly welcome...
Thank you....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如您所料,这绝不是一项微不足道的任务。这是解决此问题的一种方法:
使用倒排索引系统,例如 Lucene/Solr 或 Sphinx 来索引页面。您可能需要编写自己的爬虫/蜘蛛。 Apache Nutch 和其他爬虫提供开箱即用的蜘蛛抓取功能。如果内容相当静态,请将它们下载到本地系统。
内容被索引后,您可以通过构建布尔查询来查询电子邮件地址、电话号码等,例如:
//对于电子邮件 //对于电话 # 括号 内容:@ AND (内容:.COM OR 内容:.NET) OR 内容:"(" OR 内容:")"` 重要提示:不应按字面意思理解上述代码。您可以通过使用 Lucene Regex Query & 来获得更多的乐趣。 Span Query 可以让您构建相当复杂的查询。
最后在结果页面上,(a) 运行结果荧光笔以获取
查询词周围的代码片段,(b) 在代码片段上运行
正则表达式提取感兴趣的字段。
如果您有北美地址数据集,您可以运行
多次通过 i) 地图提供商(如 Bing 地图或 Google 地图)来验证地址,以验证地址。据我所知,美国邮政局和其他公司提供收费的有效地址查询服务,以验证美国邮政编码和加拿大邮政编码。或者,ii) 对电子邮件地址等进行反向 DNS 查找...
这应该让你开始......就像我说的,这里没有单一的最佳解决方案,你需要尝试多种方法来迭代并达到您想要的精度水平。
希望这有帮助。
This is as you might expect, by no means a trivial task. Here is one way of approaching this:
Use an inverted indexing system such as Lucene/Solr or Sphinx to index the pages. You might need to write your own crawler/spider. Apache Nutch and other crawlers offer spidering out of the box. If the content is fairly static, download them to your system locally.
Once the content is indexed, you could query it for email addresses, telephone numbers, etc. by building a boolean query such as:
//for email //for telephone # parentheses Contents:@ AND (Contents:.COM OR Contents:.NET) OR Contents:"(" OR Contents:")"` Important: the foregoing code should not be taken literally. You could get even fancier by using Lucene Regex Query & Span Query which would let you build pretty sophisticated queries.
Finally on the result pages, (a) run a result highlighter to get the
snippet(s) around the query term and, (b) on the snippets, run a
regex to extract out the fields of interest.
If you have a North American address data set, you could run
multiple-passes to validate addresses against i) a mapping provider like Bing Maps, or Google maps to verify addresses. As far as I know, USPS and others offer valid address look-ups for a fee, to validate US zip codes and Canadian Postal codes. or, ii) a reverse DNS look-up for email addresses and so on....
That should get you started....like I said, there is no single best solution here, you will need to try multiple approaches to iterate and get to the accuracy level you desire.
Hope this helps.
条件随机字段已被精确地用于此类任务,并且相当成功。您可以使用 CRF++ 或 斯坦福命名实体识别器。两者都可以从命令行调用,而无需编写任何显式代码。
简而言之,您需要能够首先通过给它们一些网页上的姓名、电子邮件 ID 等示例来训练这些算法,以便它们学会识别这些东西。一旦这些算法变得聪明(因为你给了它们的例子),你就可以在你的数据上运行它们,看看你会得到什么。
不要害怕查看维基百科页面。这些软件包附带了很多示例,您应该在几个小时内就可以启动并运行。
Conditional Random Fields have been used precisely for tasks like these, and have been fairly successful. You can use CRF++ or the Stanford Named Entity Recognizer. Both can be invoked from command line without you having to write any explicit code.
In short, you need to be able to first train these algorithms by giving them some examples of names, e-mail IDs etc from the webpages so that they learn to recognize these things. Once these algorithms have got smart (because of the examples you gave them), you can run them on your data and see what you get.
Don't get scared looking at the wikipedia page. The packages come with a lot of examples, and you should be up and running in a few hours.
@Mikos 是对的,你肯定需要多种方法。另一个可以考虑的工具是Web-Harvest。它是一个收集 Web 数据的工具,它允许您收集网站并提取您感兴趣的数据。所有这些都是通过 XML 配置文件完成的。该软件还具有 GUI 和命令行界面。
它允许您使用 XSLT、XQuery 和正则表达式等文本/xml 操作技术,您还可以构建自己的插件。然而,它主要关注基于 HTML/XML 的网站。
@Mikos is right, you will definitely need multiple approaches. Another possible tool to consider is Web-Harvest. It is a tool for harvesting Web Data and it allows you to collect websites and extract data you are interested in. All of this is done via XML configuration files. The software has a GUI and a command line interface as well.
It lets you use techniques for text/xml manipulation like XSLT, XQuery and Regular Expressions, you can also build your own plugins. It does however mainly focus on HTML/XML based websites.