从论坛中的线程中提取特定字段
我正在开发一个数据挖掘项目,我需要分析论坛线程中的讨论进度。我有兴趣提取信息,例如帖子时间、帖子作者的统计信息(帖子数量、加入日期等)、帖子文本等。
但是,在使用标准抓取工具(例如 python 中的 Scrapy)时,我需要在页面的 html 源代码中编写用于检测这些字段的正则表达式。由于这些标签随论坛类型的不同而不同,因此解决每个论坛的正则表达式成为一个主要问题。是否有可用的此类正则表达式的标准库,以便可以根据论坛的类型使用它们?
或者是否有任何其他技术可以从论坛页面中提取这些字段。
I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.
However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?
Or is there any other technique to extract these fields from the forum's page.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我为一些主要论坛写了一些配置文件。希望你能破译并推断如何解析它。
对于 VBulletin:
encapsulated_section 是包含所有线程链接的 div
线程是您可以找到每个线程的链接的地方
list_next_page 是带有线程列表的下一页的链接
post 是带有帖子文本的 div。
thread_next_page 是指向线程下一页的链接
For Invision:
I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.
For VBulletin:
enclosed_section is the div that contains links to all the threads
thread is where you'll find the link to each thread
list_next_page is the link to the next page with list of threads
post is the div with the post text.
thread_next_page is the link to the next page of the thread
For Invision:
您仍然需要为每个论坛创建多种方法。但正如亨利所言,也有很多论坛共享其结构。
关于轻松解析论坛帖子的日期, dateparser 正是从这一特定需求中诞生的,它可以是很有帮助。
You'll still have to create several approaches per forum. But as Henley suggests, there are also a lot of forums that share their structure.
About easily parsing the dates of the forum's threads, dateparser was born from this specific requirement and it could be of great help.