不知道从哪里开始,所以希望你们能解决我的问题。我有一个项目,将在电子邮件中搜索特定的单词/模式并以结构化的方式存储。使用 Trip it 完成的事情。
文章指出他们开发了一个 DataMapper
DataMapper 负责接收入站电子邮件消息
发送至 tripit.com 上的计划,并将其从
您在邮件阅读器中看到的半结构化格式变成了高度
结构化 XML 文档。
有评论也说
如果您想自己构建这个,请阅读一些关于
包装器和包装器感应可能会有所帮助
我在谷歌上搜索并阅读了有关包装器归纳的内容,但它的定义太宽泛,无法帮助我理解如何解决此类问题。
有没有一些开源项目可以做类似的事情?
Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
发布评论
评论(2)
您可以采取多种不同的方法和措施来实现这一目标。
第一部分涉及访问电子邮件内容,我不会在这里回答。基本上,我假设您可以访问电子邮件文本,如果您不能访问电子邮件文本,则有一些库允许您将 java 连接到电子邮件箱,例如camel (http://camel.apache.org/mail.html)。
现在您已经收到电子邮件了,然后呢?
一个可以帮助的方便的事情是 lingpipe (http://alias-i.com/lingpipe/) 有一个实体识别器,您可以用自己的术语填充。具体来说,看看他们的一些提取教程和字典提取器( http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html)所以在 lingpipe 字典提取器内部(http://alias-i.com/lingpipe/docs/ api/com/aliasi/dict/ExactDictionaryChunker.html)您只需导入您感兴趣的术语并使用它来将标签与电子邮件关联起来。
您可能还会发现以下问题很有帮助: 字典-基于零编辑距离的命名实体识别:LingPipe、Lucene 还是什么?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
这确实是一个非常广泛的问题,但我可以尝试给您一些一般性的想法,这可能足以开始。基本上,听起来您正在谈论一个复杂的解析问题 - 扫描文本并寻找将含义应用于特定块。根据您到底要查找的内容,您可能会从一些正则表达式中获得一些不错的结果 - 例如电话号码、电子邮件地址和日期等具有相当标准的结构,应该是可匹配的。其他数据点可能会受益于一些指示词——短语“出发”可能表明接下来是一个地址。自然语言处理社区还有一个可用于文本处理的大型工具集 - 检查词性标注器和语义分析器等工具是否适合您想要做的事情。
有了这些技术,您就可以遵循基本的迭代开发过程:对于预期输出结构中的每个数据点,定义一些关于如何捕获它的简单规则。然后,对一批测试数据运行应用程序,并查看哪些样本未捕获该数据。查看样本并修改规则以捕获这些样本。重复直到提取器达到可接受的精度水平。
根据您问题的具体情况,可能有机器学习技术可以为您自动化大部分流程。
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.