Automatic Summarization is a pretty complex area - try to get your java skills first in order as well as your understanding of statistical NLP which uses machine learning. You can then work through building something of substance. Evaluate your solution and make sure you have concretely defined your measurement variables and how you went about your evaluation. Otherwise, your project is doomed to failure. This is generally considered a high risk project for final year undergraduate students as they often are unable to get the principles right and then implement it in a way that is not right either and then their evaluation measures are all ill defined and don't reflect on their own work clearly. My advice would be to focus on one area rather then many in summarization as you can have single and multi document summaries. The more varied you make your project the less likely hold of you receiving a good mark. Keep it focused and in depth. Evaluate other peoples work then the process you decided to take and outcomes of that.
Readings: -Jurafsky book on NLP there is a back section on summarization and QA. -Advances in Text Summarization by inderjeet mani is really good
Understand what things like term weighting, centroid based summarization, log-likelihood ratio, coherence relations, sentence simplification, maximum marginal relevance, redundancy, and what a focused summary actually is.
You can attempt it using a supervised or an unsupervised approach as well as a hybrid. Linguistic is a safer option that is why you have been advised to take that approach. Try attempting it linguistically then build statistical on to hybridize your solution. Use it as an exercise to learn the theory and practical implication of the algorithms as well as build on your knowledge. As you will no doubt have to explain and defend your project to the judging panel.
If you really have read those research papers and research books you probably know what is known. Now it is up to you to implement the knowledge of those research papers and research books in a Java application. Or you could expand the human knowledge by doing some innovation/invention. If you do expand human knowledge you have become a true scientist.
Please make your question more specific, in these two main areas:
Project definition: What is the goal of your project? Is the input unit a single document? A list of documents? Do you intend your program to use machine learning? What is the output? How will you measure success?
Your background knowledge: You intend to use linguistic rather than statistical methods. Do you have background in parsing natural language? In semantic representation? I think some of these questions are tough. I am asking them because I spent too much time trying to answer similar questions in the course of my studies. Once you get these sorted out, I may be able to give you some pointers. Mani's "Automatic Summarization" looks like a good start, at least the introductory chapters.
发布评论
评论(5)
使用词汇链进行文本摘要(微软研究院)
不同算法的分析: DasMartins.2007
文档中最重要的部分:
请注意,语言方法有两种不同的细微差别:
Using Lexical Chains for Text Summarization (Microsoft Research)
An analysis of different algorithms: DasMartins.2007
Most important part in the doc:
Note there are 2 different nuances to the liguistic approach:
自动摘要是一个相当复杂的领域 - 首先尝试掌握 Java 技能,以及对使用机器学习的统计 NLP 的理解。 然后你可以通过构建一些实质性的东西来工作。 评估您的解决方案并确保您已具体定义测量变量以及如何进行评估。 否则,你的项目注定会失败。 对于最后一年的本科生来说,这通常被认为是一个高风险项目,因为他们往往无法正确理解原则,然后以也不正确的方式实施它,然后他们的评估措施都定义不明确并且没有反映自己的工作清楚。 我的建议是在摘要中专注于一个领域而不是多个领域,因为您可以有单个和多个文档摘要。 你的项目越多样化,你获得好成绩的可能性就越小。 保持重点和深度。 评估其他人的工作,然后评估你决定采取的过程和结果。
读物:
-Jurafsky 关于 NLP 的书有一个关于总结和 QA 的后面部分。
-inderjeet mani 的文本摘要进展非常好
了解术语权重、基于质心的摘要、对数似然比、连贯关系、句子简化、最大边际相关性、冗余以及重点摘要实际上是什么。
您可以尝试使用有监督或无监督的方法以及混合方法。
语言学是一个更安全的选择,这就是为什么我们建议您采取这种方法。
尝试从语言上进行尝试,然后建立统计数据来混合您的解决方案。
将其作为练习来学习算法的理论和实际含义,并巩固您的知识。 毫无疑问,您必须向评审团解释并捍卫您的项目。
Automatic Summarization is a pretty complex area - try to get your java skills first in order as well as your understanding of statistical NLP which uses machine learning. You can then work through building something of substance. Evaluate your solution and make sure you have concretely defined your measurement variables and how you went about your evaluation. Otherwise, your project is doomed to failure. This is generally considered a high risk project for final year undergraduate students as they often are unable to get the principles right and then implement it in a way that is not right either and then their evaluation measures are all ill defined and don't reflect on their own work clearly. My advice would be to focus on one area rather then many in summarization as you can have single and multi document summaries. The more varied you make your project the less likely hold of you receiving a good mark. Keep it focused and in depth. Evaluate other peoples work then the process you decided to take and outcomes of that.
Readings:
-Jurafsky book on NLP there is a back section on summarization and QA.
-Advances in Text Summarization by inderjeet mani is really good
Understand what things like term weighting, centroid based summarization, log-likelihood ratio, coherence relations, sentence simplification, maximum marginal relevance, redundancy, and what a focused summary actually is.
You can attempt it using a supervised or an unsupervised approach as well as a hybrid.
Linguistic is a safer option that is why you have been advised to take that approach.
Try attempting it linguistically then build statistical on to hybridize your solution.
Use it as an exercise to learn the theory and practical implication of the algorithms as well as build on your knowledge. As you will no doubt have to explain and defend your project to the judging panel.
如果你真的读过那些研究论文和研究书籍,你可能知道什么是已知的。 现在,您可以将这些研究论文和研究书籍中的知识应用到 Java 应用程序中。 或者你可以通过做一些创新/发明来扩展人类知识。 如果你确实扩展了人类知识,你就成为了一名真正的科学家。
If you really have read those research papers and research books you probably know what is known. Now it is up to you to implement the knowledge of those research papers and research books in a Java application. Or you could expand the human knowledge by doing some innovation/invention. If you do expand human knowledge you have become a true scientist.
请在以下两个主要领域提出更具体的问题:
输入单元是单个文档吗? 文件清单?
您打算让您的程序使用机器学习吗?
输出是什么?
您将如何衡量成功?
您有解析自然语言的背景吗? 在语义表示中?
我认为其中一些问题很难。 我问他们是因为我在学习过程中花了太多时间试图回答类似的问题。 一旦你解决了这些问题,我也许可以给你一些指导。 Mani 的“自动摘要” 看起来是一个不错的开始,至少是介绍性章节。
Please make your question more specific, in these two main areas:
Is the input unit a single document? A list of documents?
Do you intend your program to use machine learning?
What is the output?
How will you measure success?
Do you have background in parsing natural language? In semantic representation?
I think some of these questions are tough. I am asking them because I spent too much time trying to answer similar questions in the course of my studies. Once you get these sorted out, I may be able to give you some pointers. Mani's "Automatic Summarization" looks like a good start, at least the introductory chapters.
作为欧盟的一部分,谢菲尔德大学在自动电子邮件摘要方面做了一些工作几年前的 FASiL 项目。
The University of Sheffield did some work on automatic email summarising as part of the EU FASiL project a few years back.