如何从文本中提取引文(PHP)?

发布于 2024-08-02 12:49:12 字数 1170 浏览 7 评论 0原文

您好!

我想从文本中提取所有引用。此外,还应提取被引用人的姓名。 DayLife 在这方面做得很好。

示例:

“他们认为‘游戏结束了’,”一位高级政府官员表示。

应提取短语他们认为“游戏结束”,以及所引用的人一位高级行政官员

你认为这可能吗?只有检查是否提到了被引用的人,才能区分引文和引文中的单词。

示例:

“我认为情况很严重,而且正在恶化,”马伦上将周日在 CNN 的“国情咨文”节目中表示。

国情咨文这句话不是引文。但你如何检测到这一点呢? a) 您检查是否提到了被引用的人。 b) 你数一下假定引文中的空格。如果空格少于 3 个就不是引用了,对吗?我更喜欢 b),因为并不总是被引用的人被命名。

如何开始?

我首先将所有类型的引号替换为一种类型,这样您稍后只需检查一个引号。

<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>

然后我会提取引号之间包含超过 3 个空格的所有短语:

<?php
function extract_quotations($text) {
   $result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
   if ($result == TRUE) {
      return $found_quotations;
      // check for count of blank spaces
   }
   return array();
}
?>

您如何改进这一点?

我希望你能帮助我。预先非常感谢您!

Hello!

I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.

Example:

“They think it’s ‘game over,’ ” one senior administration official said.

The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.

Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.

Example:

“I think it is serious and it is deteriorating,” Admiral Mullen said Sunday on CNN’s “State of the Union” program.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

How to start?

I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.

<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>

Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:

<?php
function extract_quotations($text) {
   $result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
   if ($result == TRUE) {
      return $found_quotations;
      // check for count of blank spaces
   }
   return array();
}
?>

How could you improve this?

I hope you can help me. Thank you very much in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

落日海湾 2024-08-09 12:49:12

正如 ceejayoz 已经指出的那样,这不适合单个函数。您在问题中所描述的内容(检测句子中引号转义部分的语法功能 - 即“我认为情况很严重并且正在恶化”与“国情咨文”)最好通过图书馆来解决可以将自然语言分解为标记。我不知道 PHP 中有任何这样的库,但你可以看看你在 python 中使用的项目的大小: http://www.nltk.org/

我认为您能做的最好的事情就是定义一组手动验证的语法规则。像这样的事情怎么样:

abstract class QuotationExtractor {

    protected static $instances;

    public static function getAllPossibleQuotations($string) {
        $possibleQuotations = array();
        foreach (self::$instances as $instance) {
            $possibleQuotations = array_merge(
                $possibleQuotations,
                $instance->extractQuotations($string)
            );
        }
        return $possibleQuotations;
    }

    public function __construct() {
        self::$instances[] = $this;
    }

    public abstract function extractQuotations($string);

}

class RegexExtractor extends QuotationExtractor {

    protected $rules;

    public function extractQuotations($string) {
        $quotes = array();
        foreach ($this->rules as $rule) {
            preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
            foreach ($matches as $match) {
                $quotes[] = array(
                    'quote' => trim($match[$rule[1]]),
                    'cited' => trim($match[$rule[2]])
                );
            }
        }
        return $quotes;
    }

    public function addRule($regex, $quoteIndex, $authorIndex) {
        $this->rules[] = array($regex, $quoteIndex, $authorIndex);
    }

}

$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);

class AnotherExtractor extends Quot...

如果你有一个像上面这样的结构,你可以通过任何/所有的结构运行相同的文本,并列出可能的引用以选择正确的引用。我已使用该线程作为测试输入运行代码,结果是:

array(4) {
  [0]=>
  array(2) {
    ["quote"]=>
    string(15) "Not necessarily"
    ["cited"]=>
    string(8) "ceejayoz"
  }
  [1]=>
  array(2) {
    ["quote"]=>
    string(28) "They think it's `game over,'"
    ["cited"]=>
    string(34) "one senior administration official"
  }
  [2]=>
  array(2) {
    ["quote"]=>
    string(46) "I think it is serious and it is deteriorating,"
    ["cited"]=>
    string(14) "Admiral Mullen"
  }
  [3]=>
  array(2) {
    ["quote"]=>
    string(16) "Not necessarily,"
    ["cited"]=>
    string(0) ""
  }
}

As ceejayoz already pointed out, this won't fit into a single function. What you're describing in your question (detecting grammatical function of a quote-escaped part of a sentence - i.e. “I think it is serious and it is deteriorating,” vs "State of the Union") would be best solved with a library that can break down natural language into tokens. I am not aware of any such library in PHP, but you can have a look at the project size of something you would use in python: http://www.nltk.org/

I think the best you can do is define a set of syntax rules that you verify manually. What about something like this:

abstract class QuotationExtractor {

    protected static $instances;

    public static function getAllPossibleQuotations($string) {
        $possibleQuotations = array();
        foreach (self::$instances as $instance) {
            $possibleQuotations = array_merge(
                $possibleQuotations,
                $instance->extractQuotations($string)
            );
        }
        return $possibleQuotations;
    }

    public function __construct() {
        self::$instances[] = $this;
    }

    public abstract function extractQuotations($string);

}

class RegexExtractor extends QuotationExtractor {

    protected $rules;

    public function extractQuotations($string) {
        $quotes = array();
        foreach ($this->rules as $rule) {
            preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
            foreach ($matches as $match) {
                $quotes[] = array(
                    'quote' => trim($match[$rule[1]]),
                    'cited' => trim($match[$rule[2]])
                );
            }
        }
        return $quotes;
    }

    public function addRule($regex, $quoteIndex, $authorIndex) {
        $this->rules[] = array($regex, $quoteIndex, $authorIndex);
    }

}

$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);

class AnotherExtractor extends Quot...

If you have a structure like the above you can run the same text through any/all of them and list the possible quotations to select the correct ones. I've run the code with this thread as input for testing and the result was:

array(4) {
  [0]=>
  array(2) {
    ["quote"]=>
    string(15) "Not necessarily"
    ["cited"]=>
    string(8) "ceejayoz"
  }
  [1]=>
  array(2) {
    ["quote"]=>
    string(28) "They think it's `game over,'"
    ["cited"]=>
    string(34) "one senior administration official"
  }
  [2]=>
  array(2) {
    ["quote"]=>
    string(46) "I think it is serious and it is deteriorating,"
    ["cited"]=>
    string(14) "Admiral Mullen"
  }
  [3]=>
  array(2) {
    ["quote"]=>
    string(16) "Not necessarily,"
    ["cited"]=>
    string(0) ""
  }
}
森林散布 2024-08-09 12:49:12

如果空格少于 3 个,就不是引号了,对吗?

“不一定,”ceejayoz 说。

国情咨文段落不是引文。但你如何检测到这一点呢? a) 您检查是否提到了被引用的人。 b) 你数一下假定引文中的空格。如果空格少于 3 个就不是引用了,对吗?我更喜欢 b),因为并不总是被引用的人被命名。

b) 甚至不适用于这个例子 - “State of the Union”中有 3 个空格。

If there are less than 3 blank spaces it won't be a quotation, right?

"Not necessarily," said ceejayoz.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

b) doesn't even work for this very example - there are 3 blank spaces in "State of the Union".

林空鹿饮溪 2024-08-09 12:49:12

引文总是有标点符号——要么在末尾加一个逗号,表示后面跟着说话者的名字或标题,要么在句子的末尾(.!?)。

A quotation will always have punctuation--either a comma at the end, to signify that the speaker's name or title is to follow, or the end of the sentence (.!?).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文