使用正则表达式从纯文本套用信函中提取变量?

发布于 2024-08-28 17:48:13 字数 996 浏览 9 评论 0原文

我正在寻找一个在 PHP 中使用正则表达式来“逆向工程”套用信函(当然具有已知格式)的好例子,该套用信函已粘贴到多行文本框中并发送到脚本进行处理。

例如,我们假设这是原始的纯文本输入(摘自美国农业部新闻稿):

华盛顿,2010 年 4 月 5 日 - 北 美洲野牛合作社,新罗克福德, ND,机构正在召回 约 25,000 磅 含有舌头的牛头可能会 扁桃体还没有完全长出来 已删除,不符合 要求删除的规定 来自各个年龄段牛的扁桃体, 美国农业部的 食品安全检验服务 (FSIS) 今天宣布。

为了清楚起见,下面突出显示了作为变量的字段:

[pr_city=]华盛顿[pr_date=]2010 年 4 月 5 日 - [corp_name=]北方 美洲野牛合作社[corp_city=]新罗克福德[corp_state=]ND,机构正在召回 大约[amount=]25,000 磅[product=]整个 含有舌头的牛头可能会 扁桃体还没有完全长出来 已删除,这不符合 要求[原因=]删除的法规 来自各个年龄段的牛的扁桃体, 美国农业部的 食品安全检验服务 (FSIS)今天宣布。

如何

  • 有效地提取pr_city
  • pr_date
  • corp_name
  • corp_city
  • corp_state
  • amount
  • 产品
  • 原因

从我的示例中

字段的内容?任何帮助将不胜感激,谢谢。

I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.

So, for example, let's assume this is the original plain-text input (taken from a USDA press release):

WASHINGTON, April 5, 2010 - North
American Bison Co-Op, a New Rockford,
N.D., establishment is recalling
approximately 25,000 pounds of whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.

For clarity, the fields that are variables are highlighted below:

[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North
American Bison Co-Op
, a [corp_city=]New Rockford,
[corp_state=]N.D., establishment is recalling
approximately [amount=]25,000 pounds of [product=]whole
beef heads containing tongues that may
not have had the tonsils completely
removed
, which is not compliant with
regulations that require [reason=]the removal
of tonsils from cattle of all ages
,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.

How could I efficiently extract the contents of the

  • pr_city
  • pr_date
  • corp_name
  • corp_city
  • corp_state
  • amount
  • product
  • reason

fields from my example?

Any help would be appreciated, thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

娇女薄笑 2024-09-04 17:48:13

好吧,适用于您的示例的正则表达式可能如下所示(引入换行符是为了保持这个野兽的可读性,需要在使用之前删除):

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

所以,在 PHP 中您可以这样做

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

这假设了一些事情,例如没有换行符,并且输入是整个字符串(而不是必须从中提取该部分的较大字符串)。我试图对合法值做出一些有意义的假设,但其他输入很可能会打破这一点。因此可能需要更多的测试用例。

Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

So, in PHP you could do

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.

甲如呢乙后呢 2024-09-04 17:48:13

如果周围的文本是不变的,那么像这个部分正则表达式这样的东西就可以解决问题:

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

如果周围的文本发生变化,那么你最终会得到大量错误的匹配,没有匹配,等等......本质上你会需要人工智能来解析/理解 PR 版本。

If the surrounding text is constant, then something like this partial regex could do the trick:

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.

烟凡古楼 2024-09-04 17:48:13

编辑:请忽略这个疯狂的答案,因为其他两个更好。我可能应该删除它,但我保留它以供参考。

我有一个可能可行的疯狂想法:通过添加标记从输入构建 XML 字符串,然后解析它。它可能看起来像这样(完全未经测试)的代码:

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

之后解析 XML 是一个不必要的复杂过程,最好留给 PHP 文档: http://www.php.net/manual/en/function.xml-parse.php

您还可以考虑使用此方法将其转换为 JSON,然后使用 json_decode() 解析它。无论如何,您都必须考虑当输入中出现 " 标记和 > 符号时会发生什么。

仅匹配并删除其中的一个可能会更容易一次文本。

Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.

I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .

You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.

It might be easier to just match and remove one piece of the text at a time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文