从论坛中的线程中提取特定字段

发布于 2024-10-29 19:16:45 字数 269 浏览 6 评论 0原文

我正在开发一个数据挖掘项目,我需要分析论坛线程中的讨论进度。我有兴趣提取信息,例如帖子时间、帖子作者的统计信息(帖子数量、加入日期等)、帖子文本等。

但是,在使用标准抓取工具(例如 python 中的 Scrapy)时,我需要在页面的 html 源代码中编写用于检测这些字段的正则表达式。由于这些标签随论坛类型的不同而不同,因此解决每个论坛的正则表达式成为一个主要问题。是否有可用的此类正则表达式的标准库,以便可以根据论坛的类型使用它们?

或者是否有任何其他技术可以从论坛页面中提取这些字段。

I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.

However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?

Or is there any other technique to extract these fields from the forum's page.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

匿名。 2024-11-05 19:16:45

我为一些主要论坛写了一些配置文件。希望你能破译并推断如何解析它。

对于 VBulletin:

enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;>
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;>

encapsulated_section 是包含所有线程链接的 div
线程是您可以找到每个线程的链接的地方
list_next_page 是带有线程列表的下一页的链接
post 是带有帖子文本的 div。
thread_next_page 是指向线程下一页的链接

For Invision:

enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl

I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.

For VBulletin:

enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;>
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;>

enclosed_section is the div that contains links to all the threads
thread is where you'll find the link to each thread
list_next_page is the link to the next page with list of threads
post is the div with the post text.
thread_next_page is the link to the next page of the thread

For Invision:

enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl
萧瑟寒风 2024-11-05 19:16:45

您仍然需要为每个论坛创建多种方法。但正如亨利所言,也有很多论坛共享其结构。

关于轻松解析论坛帖子的日期, dateparser 正是从这一特定需求中诞生的,它可以是很有帮助。

You'll still have to create several approaches per forum. But as Henley suggests, there are also a lot of forums that share their structure.

About easily parsing the dates of the forum's threads, dateparser was born from this specific requirement and it could be of great help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文