在毫无头绪的情况下解析和分割多个 HTML 页面
我有大约 50 个与我的项目相关的原始 HTML 页面内容。我不确定这些内容是否有独特的模式。
我需要解析所有页面的内容,并且必须根据关键字进行分类。
关键字都像这样
'REVIEWS',"REPORTS","FEEDBACK","DESCRIPTION","COMMENTS","SUCCESS RATES","FAILURE RATES"
抓取的 HTML 内容必须进行分类,并且 < strong>映射到相关关键字。
还需要从页面中拆分内容和标题以进行比较,
我使用的是 Python。
您能建议这样做的方法吗?哪个适合选择?这个想法必须如何组织?
I have the some 50 raw HTML page contents which are relevant to my project. I am not sure these contents are having unique pattern.
I need to parse the contents from all pages and has to be classified based on the keywords.
Keywords all like that
'REVIEWS',"REPORTS","FEEDBACK","DESCRIPTION","COMMENTS","SUCCESS RATES","FAILURE RATES"
The crawled HTML content has to be classified and mapped to the relevant keywords.
Also need to be split the contents and it's headers from the page for comparison
I am using Python.
Would you please suggest the way to do this? Which will be suitable to choose? How the idea has to be organised?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个典型的分类问题。您可以使用贝叶斯分类器来识别页面属于哪个类别。这将使您能够轻松扩展您所关注的网站。
查看
http://www.python-course.eu/text_classification_introduction.php
一般介绍。我真正推荐的是 O'Reilly 出版的一本名为“集体智慧编程”的书,书中的示例是用 Python 编写的,并且有一章专门介绍您想要做的事情。它们没有涉及重要的细节,但足以让您启动并运行。
如果您只是想探索如何识别页面等。请尝试 Weka,这是一个基于 Java 的工具。显然这不符合你的 python 要求,所以如果你对一般领域感兴趣,我建议它更多地作为学习工具。
This is a typical classification problem. You could use a bayesian classifier to identify what category a page belongs to. This would allow you to easily scale the sites you are following easily.
Check out
http://www.python-course.eu/text_classification_introduction.php
For a general introduction. What I'd really recommend is a book called programming collective intelligence from O'Reilly, the book examples are in python and they have a chapter dedicated to what you are trying to do. They don't go into significant detail but enough to get you up and running.
IF you just want to explore how to identify the pages etc. Try Weka which is a java based tool. Obviously this doesn't match your python requirements so I'd suggest it more as a learning tool if you are interested in the general area.
如果你需要根据页面内容进行分类,我建议你看看NLTK(http://www.nltk.org/),这是一个开源Python模块的自然语言工具包。
不要只是尝试查看页面中出现的“报告”等内容。报告的标题或内容中可能有也可能没有“报告”。您可以使用 NLTK 查找与您的关键字相关的术语(例如成功率与批准率),或来自同一族的术语(例如描述与描述)。
查看页面的内容并尝试定义它们与其他页面的区别。例如,带有评论的页面可能会包含“我认为”、“在我看来”等表达方式和主观术语,通常是形容词和副词,如“好”、“很快”、“可怕”等。里面不太可能有这样的词。
除了内容之外,页面的结构也可能因类别而异。如果您打算对此进行分析,也许使用 Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) 进行解析是一个好主意。
If you need to do classification given the content of pages, I would suggest you to take a look at NLTK (http://www.nltk.org/), a natural language toolkit of open source python modules.
Don't just try to look at occurrences of e.g. "report" in the the pages. A report may or may not have "report" as a title or in the content. You can use NLTK to find terms related to your keywords (e.g. success rates vs. approval rates), or from the same family (e.g. description vs. described).
Take a look at the pages' contents and try to define what sets them apart from the others. For instance, a page with comments will probably have expressions such as "I think that", "in my opinion" and subjective terms, usually adjectives and adverbs, like "good", "quickly", "horrible", etc. A report is unlikely to have such words in it.
Apart from the content, the structure of the page may vary from category to category. If you intend to analyse that, maybe using Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) for parsing is a good idea.