使用正则表达式从纯文本套用信函中提取变量？

发布于 2024-08-28 17:48:13 字数 996 浏览 9 评论 0原文

我正在寻找一个在 PHP 中使用正则表达式来“逆向工程”套用信函（当然具有已知格式）的好例子，该套用信函已粘贴到多行文本框中并发送到脚本进行处理。

例如，我们假设这是原始的纯文本输入（摘自美国农业部新闻稿）：

华盛顿，2010 年 4 月 5 日 - 北美洲野牛合作社，新罗克福德， ND，机构正在召回约 25,000 磅含有舌头的牛头可能会扁桃体还没有完全长出来已删除，不符合要求删除的规定来自各个年龄段牛的扁桃体，美国农业部的食品安全检验服务 (FSIS) 今天宣布。

为了清楚起见，下面突出显示了作为变量的字段：

[pr_city=]华盛顿，[pr_date=]2010 年 4 月 5 日 - [corp_name=]北方美洲野牛合作社，[corp_city=]新罗克福德， [corp_state=]ND，机构正在召回大约[amount=]25,000 磅[product=]整个含有舌头的牛头可能会扁桃体还没有完全长出来已删除，这不符合要求[原因=]删除的法规来自各个年龄段的牛的扁桃体，美国农业部的食品安全检验服务（FSIS）今天宣布。

如何

有效地提取pr_city
pr_date
corp_name
corp_city
corp_state
amount
产品
原因

从我的示例中

字段的内容？任何帮助将不胜感激，谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇女薄笑 2024-09-04 17:48:13

好吧，适用于您的示例的正则表达式可能如下所示（引入换行符是为了保持这个野兽的可读性，需要在使用之前删除）：

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

所以，在 PHP 中您可以这样做

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

这假设了一些事情，例如没有换行符，并且输入是整个字符串（而不是必须从中提取该部分的较大字符串）。我试图对合法值做出一些有意义的假设，但其他输入很可能会打破这一点。因此可能需要更多的测试用例。

Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

So, in PHP you could do

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.

回复收藏 0 原文

甲如呢乙后呢 2024-09-04 17:48:13

如果周围的文本是不变的，那么像这个部分正则表达式这样的东西就可以解决问题：

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

如果周围的文本发生变化，那么你最终会得到大量错误的匹配，没有匹配，等等......本质上你会需要人工智能来解析/理解 PR 版本。

If the surrounding text is constant, then something like this partial regex could do the trick:

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.

回复收藏 0 原文

烟凡古楼 2024-09-04 17:48:13

编辑：请忽略这个疯狂的答案，因为其他两个更好。我可能应该删除它，但我保留它以供参考。

我有一个可能可行的疯狂想法：通过添加标记从输入构建 XML 字符串，然后解析它。它可能看起来像这样（完全未经测试）的代码：

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

之后解析 XML 是一个不必要的复杂过程，最好留给 PHP 文档： http://www.php.net/manual/en/function.xml-parse.php 。

您还可以考虑使用此方法将其转换为 JSON，然后使用 json_decode() 解析它。无论如何，您都必须考虑当输入中出现 " 标记和 > 符号时会发生什么。

仅匹配并删除其中的一个可能会更容易一次文本。

Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.

I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .

You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.

It might be easier to just match and remove one piece of the text at a time.

回复收藏 0 原文

~没有更多了~