有没有可以清理内容的 PHP 类?

发布于 2024-12-11 16:20:01 字数 12713 浏览 0 评论 0原文

我一直在尝试使用一系列正则表达式和 PHP 函数 preg_replace 在 PHP 中编写。

我的主要目标是整理内容,例如确保句子的开头是大写字母;逗号后面有一个空格;等等。

我试图实现的整理的一些例子:

// Remove any spaces around slashes
$content_replacements_from[] = "/\s*\/\s*/";
$content_replacements_to[] = "/";

// Remove any new lines or tabs
$content_replacements_from[] = "/[\r\n\t]/";
$content_replacements_to[] = " ";

// Remove any extra spaces
$content_replacements_from[] = "/\s{2,}/";
$content_replacements_to[] = " ";

// Tidy up joined full stops
$content_replacements_from[] = "/([a-zA-Z]{1})\s*[\.]{1}\s*([^(jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)]{1})/";
$content_replacements_to[] = "$1. $2";

// Tidy up joined commas
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\,]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1, $2";

// Tidy up joined exclamation marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\!]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1! $2";

// Tidy up joined question marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\?]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1? $2";

// Tidy up joined semi colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\;]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1; $2";

// Tidy up joined colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\:]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1: $2";

// Tidy up fluid ounces
$content_replacements_from[] = "/[Ff]{1}[Ll]{1}.?\s?[Oo]{1}[Zz]{1}/";
$content_replacements_to[] = "fl oz";

// Tidy up rpm
$content_replacements_from[] = "/[Rr]{1}[Pp]{1}[Mm]{1}/";
$content_replacements_to[] = "rpm";

// Tidy up UK
$content_replacements_from[] = "/[Uu]{1}[Kk]{1}/";
$content_replacements_to[] = "UK";

// Tidy up Maxi-sense
$content_replacements_from[] = "/[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "maxi-sense";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = ". Maxi-sense";
$content_replacements_from[] = "/^[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "Maxi-sense";

// Tidy up Side-by-side
$content_replacements_from[] = "/[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "side-by-side";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = ". Side-by-side";
$content_replacements_from[] = "/^[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "Side-by-side";

// Tidy up extra large
$content_replacements_from[] = "/[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "extra large";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";
$content_replacements_from[] = "/^[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";

// Tidy up D-radius
$content_replacements_from[] = "/[Dd]{1}[\s\-]?[Rr]{1}adius/";
$content_replacements_to[] = "D-radius";

// Tidy up A-rate
$content_replacements_from[] = "/[Aa]{1}[\s\-]?[Rr]{1}ate/";
$content_replacements_to[] = "A-rate";

// Tidy up In-column
$content_replacements_from[] = "/[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "in-column";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";
$content_replacements_from[] = "/^[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";

// Tidy up kW
$content_replacements_from[] = "/[Kk]{1}[Ww]{1}/";
$content_replacements_to[] = "kW";

// Tidy up Built-in
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "built-in";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";

// Tidy up Built-under
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "built-under";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";

// Tidy up Under-counter
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "under-counter";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";

// Tidy up Under-cabinet
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "under-cabinet";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";

// Tidy up integrated
$content_replacements_from[] = "/([a-zA-Z0-9]{1})[\s]{1}[\-]{1}[Ii]{1}ntegrated/";
$content_replacements_to[] = "$1-integrated";

// Tidy up Semi-integrated
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "semi-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";

// Tidy up Fully-integrated
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "fully-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";

// Tidy up Semi-automatic
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "semi-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";

// Tidy up Fully-automatic
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "fully-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";

// Tidy up Pull-out
$content_replacements_from[] = "/[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "pull-out";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";
$content_replacements_from[] = "/^[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";

// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}nc[l]?[\.]?\s/";
$content_replacements_to[] = " including ";

// Tidy up use
$content_replacements_from[] = "/\s[Uu]{1}se\s/";
$content_replacements_to[] = " use ";

// Tidy up ?-piece
$content_replacements_from[] = "/([2345TtYy]{1})[\s\-]?[Pp]{1}iece/";
$content_replacements_to[] = "$1-piece";

// Tidy up ?-spout
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ss]{1}pout/";
$content_replacements_to[] = "$1-spout";

// Tidy up ?-end
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ee]{1}nd/";
$content_replacements_to[] = "$1-end";

// Tidy up Brushed Steel
$content_replacements_from[] = "/[Bb]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "brushed steel";

// Tidy up Stainless Steel
$content_replacements_from[] = "/[Ss]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "stainless steel";

// Tidy up Silk Steel
$content_replacements_from[] = "/[Ss]{1}ilk[\s]?[Ss]{1}teel/";
$content_replacements_to[] = "silk steel";

// Remove trade marks
$content_replacements_from[] = "/™/";
$content_replacements_to[] = "";

// Replace long dashes
$content_replacements_from[] = "/–/";
$content_replacements_to[] = "-";

// Replace single quotes
$content_replacements_from[] = "/’/";
$content_replacements_to[] = "'";
$content_replacements_from[] = "/`/";
$content_replacements_to[] = "'";

// Tidy up m
$content_replacements_from[] = "/[\s]?[Mm]{1}etre/";
$content_replacements_to[] = "m";

// Tidy up m3
$content_replacements_from[] = "/([0-9]{1})[\s]?[Mm]{1}3/";
$content_replacements_to[] = "$1m³";
$content_replacements_from[] = "/\&sup3\;/";
$content_replacements_to[] = html_entity_decode("³");

// Tidy up to in between numbers
$content_replacements_from[] = "/([0-9]{1})[\s]?to[\s]?([0-9]{1})/";
$content_replacements_to[] = "$1 - $2";

// Tidy up per hour
$content_replacements_from[] = "/\s[Aa]{1}nd\s[Hh]{1}[Rr]?$/";
$content_replacements_to[] = "ph";

// Tidy up l
$content_replacements_from[] = "/[\s]?[Ll]{1}itre/";
$content_replacements_to[] = "l";

// Tidy up -in
$content_replacements_from[] = "/\-[Ii]{1}n/";
$content_replacements_to[] = "-in";

// Tidy up plus
$content_replacements_from[] = "/\s[Pp]{1}lus\s/";
$content_replacements_to[] = " plus ";

// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}ncluding\s/";
$content_replacements_to[] = " including ";

// Tidy up including
$content_replacements_from[] = "/[Ii]{1}nc\s/";
$content_replacements_to[] = "Including "; 

// Tidy up Push/pull
$content_replacements_from[] = "/[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "push/pull";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";
$content_replacements_from[] = "/^[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";

// Tidy up +
$content_replacements_from[] = "/\s\+\s/";
$content_replacements_to[] = " and ";

// Tidy up *
$content_replacements_from[] = "/\*/";
$content_replacements_to[] = "";

// Tidy up with
$content_replacements_from[] = "/\s[Ww]{1}ith\s/";
$content_replacements_to[] = " with ";

// Tidy up without
$content_replacements_from[] = "/\s[Ww]{1}ithout\s/";
$content_replacements_to[] = " without ";

// Tidy up in
$content_replacements_from[] = "/\s[Ii]{1}n\s/";
$content_replacements_to[] = " in ";

// Tidy up of
$content_replacements_from[] = "/\s[Oo]{1}f\s/";
$content_replacements_to[] = " of ";

// Tidy up for
$content_replacements_from[] = "/\s[Ff]{1}or\s/";
$content_replacements_to[] = " for ";

// Tidy up or
$content_replacements_from[] = "/\s[Oo]{1}r\s/";
$content_replacements_to[] = " or ";

// Tidy up and
$content_replacements_from[] = "/\s[Aa]{1}nd\s/";
$content_replacements_to[] = " and ";

// Tidy up to
$content_replacements_from[] = "/\s[Tt]{1}o\s/";
$content_replacements_to[] = " to ";

// Tidy up too
$content_replacements_from[] = "/\s[Tt]{1}oo\s/";
$content_replacements_to[] = " too ";

// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";

// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";

// Tidy up mm
$content_replacements_from[] = "/M[Mm]{1}/";
$content_replacements_to[] = "mm";

// Tidy up ize to ise
$content_replacements_from[] = "/([a-zA-Z]{2})ize{1}/";
$content_replacements_to[] = "$1ise";

// Tidy up izer to iser
$content_replacements_from[] = "/([a-zA-Z]{2})izer{1}/";
$content_replacements_to[] = "$1iser";

// Tidy up yze to yse
$content_replacements_from[] = "/([a-zA-Z]{2})yze{1}/";
$content_replacements_to[] = "$1yse";

// Tidy up ization to isation
$content_replacements_from[] = "/([a-zA-Z]{2})ization{1}/";
$content_replacements_to[] = "$1isation";

// Tidy up times symbol
$content_replacements_from[] = "/([0-9]{1})\s*[Xx]\s*([0-9A-Za-z]{1})/";
$content_replacements_to[] = "$1 × $2";

// Tidy up times symbol
$content_replacements_from[] = "/\&times\;/";
$content_replacements_to[] = html_entity_decode("×");

// Tidy up inches
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nches/";
$content_replacements_to[] = "$1\"";

// Tidy up inch
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nch/";
$content_replacements_to[] = "$1\"";

// Make the replacements
$content = preg_replace($content_replacements_from, $content_replacements_to, $content);

这显然是复杂和冗长的。

有谁知道更好的方法或知道可以做到这一点的课程吗?

如果可能的话,我也想将其应用于 HTML 中的内容。

I have been trying to write in PHP using a series of regular expressions and the PHP function preg_replace.

My main aim is to tidy up the content with things like making sure the beginning of a sentence has an uppercase letter; there is a space after a comma; etc.

Some examples of the tidying I am trying to achieve:

// Remove any spaces around slashes
$content_replacements_from[] = "/\s*\/\s*/";
$content_replacements_to[] = "/";

// Remove any new lines or tabs
$content_replacements_from[] = "/[\r\n\t]/";
$content_replacements_to[] = " ";

// Remove any extra spaces
$content_replacements_from[] = "/\s{2,}/";
$content_replacements_to[] = " ";

// Tidy up joined full stops
$content_replacements_from[] = "/([a-zA-Z]{1})\s*[\.]{1}\s*([^(jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)]{1})/";
$content_replacements_to[] = "$1. $2";

// Tidy up joined commas
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\,]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1, $2";

// Tidy up joined exclamation marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\!]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1! $2";

// Tidy up joined question marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\?]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1? $2";

// Tidy up joined semi colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\;]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1; $2";

// Tidy up joined colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\:]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1: $2";

// Tidy up fluid ounces
$content_replacements_from[] = "/[Ff]{1}[Ll]{1}.?\s?[Oo]{1}[Zz]{1}/";
$content_replacements_to[] = "fl oz";

// Tidy up rpm
$content_replacements_from[] = "/[Rr]{1}[Pp]{1}[Mm]{1}/";
$content_replacements_to[] = "rpm";

// Tidy up UK
$content_replacements_from[] = "/[Uu]{1}[Kk]{1}/";
$content_replacements_to[] = "UK";

// Tidy up Maxi-sense
$content_replacements_from[] = "/[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "maxi-sense";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = ". Maxi-sense";
$content_replacements_from[] = "/^[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "Maxi-sense";

// Tidy up Side-by-side
$content_replacements_from[] = "/[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "side-by-side";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = ". Side-by-side";
$content_replacements_from[] = "/^[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "Side-by-side";

// Tidy up extra large
$content_replacements_from[] = "/[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "extra large";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";
$content_replacements_from[] = "/^[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";

// Tidy up D-radius
$content_replacements_from[] = "/[Dd]{1}[\s\-]?[Rr]{1}adius/";
$content_replacements_to[] = "D-radius";

// Tidy up A-rate
$content_replacements_from[] = "/[Aa]{1}[\s\-]?[Rr]{1}ate/";
$content_replacements_to[] = "A-rate";

// Tidy up In-column
$content_replacements_from[] = "/[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "in-column";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";
$content_replacements_from[] = "/^[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";

// Tidy up kW
$content_replacements_from[] = "/[Kk]{1}[Ww]{1}/";
$content_replacements_to[] = "kW";

// Tidy up Built-in
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "built-in";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";

// Tidy up Built-under
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "built-under";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";

// Tidy up Under-counter
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "under-counter";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";

// Tidy up Under-cabinet
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "under-cabinet";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";

// Tidy up integrated
$content_replacements_from[] = "/([a-zA-Z0-9]{1})[\s]{1}[\-]{1}[Ii]{1}ntegrated/";
$content_replacements_to[] = "$1-integrated";

// Tidy up Semi-integrated
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "semi-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";

// Tidy up Fully-integrated
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "fully-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";

// Tidy up Semi-automatic
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "semi-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";

// Tidy up Fully-automatic
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "fully-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";

// Tidy up Pull-out
$content_replacements_from[] = "/[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "pull-out";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";
$content_replacements_from[] = "/^[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";

// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}nc[l]?[\.]?\s/";
$content_replacements_to[] = " including ";

// Tidy up use
$content_replacements_from[] = "/\s[Uu]{1}se\s/";
$content_replacements_to[] = " use ";

// Tidy up ?-piece
$content_replacements_from[] = "/([2345TtYy]{1})[\s\-]?[Pp]{1}iece/";
$content_replacements_to[] = "$1-piece";

// Tidy up ?-spout
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ss]{1}pout/";
$content_replacements_to[] = "$1-spout";

// Tidy up ?-end
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ee]{1}nd/";
$content_replacements_to[] = "$1-end";

// Tidy up Brushed Steel
$content_replacements_from[] = "/[Bb]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "brushed steel";

// Tidy up Stainless Steel
$content_replacements_from[] = "/[Ss]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "stainless steel";

// Tidy up Silk Steel
$content_replacements_from[] = "/[Ss]{1}ilk[\s]?[Ss]{1}teel/";
$content_replacements_to[] = "silk steel";

// Remove trade marks
$content_replacements_from[] = "/™/";
$content_replacements_to[] = "";

// Replace long dashes
$content_replacements_from[] = "/–/";
$content_replacements_to[] = "-";

// Replace single quotes
$content_replacements_from[] = "/’/";
$content_replacements_to[] = "'";
$content_replacements_from[] = "/`/";
$content_replacements_to[] = "'";

// Tidy up m
$content_replacements_from[] = "/[\s]?[Mm]{1}etre/";
$content_replacements_to[] = "m";

// Tidy up m3
$content_replacements_from[] = "/([0-9]{1})[\s]?[Mm]{1}3/";
$content_replacements_to[] = "$1m³";
$content_replacements_from[] = "/\³\;/";
$content_replacements_to[] = html_entity_decode("³");

// Tidy up to in between numbers
$content_replacements_from[] = "/([0-9]{1})[\s]?to[\s]?([0-9]{1})/";
$content_replacements_to[] = "$1 - $2";

// Tidy up per hour
$content_replacements_from[] = "/\s[Aa]{1}nd\s[Hh]{1}[Rr]?$/";
$content_replacements_to[] = "ph";

// Tidy up l
$content_replacements_from[] = "/[\s]?[Ll]{1}itre/";
$content_replacements_to[] = "l";

// Tidy up -in
$content_replacements_from[] = "/\-[Ii]{1}n/";
$content_replacements_to[] = "-in";

// Tidy up plus
$content_replacements_from[] = "/\s[Pp]{1}lus\s/";
$content_replacements_to[] = " plus ";

// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}ncluding\s/";
$content_replacements_to[] = " including ";

// Tidy up including
$content_replacements_from[] = "/[Ii]{1}nc\s/";
$content_replacements_to[] = "Including "; 

// Tidy up Push/pull
$content_replacements_from[] = "/[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "push/pull";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";
$content_replacements_from[] = "/^[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";

// Tidy up +
$content_replacements_from[] = "/\s\+\s/";
$content_replacements_to[] = " and ";

// Tidy up *
$content_replacements_from[] = "/\*/";
$content_replacements_to[] = "";

// Tidy up with
$content_replacements_from[] = "/\s[Ww]{1}ith\s/";
$content_replacements_to[] = " with ";

// Tidy up without
$content_replacements_from[] = "/\s[Ww]{1}ithout\s/";
$content_replacements_to[] = " without ";

// Tidy up in
$content_replacements_from[] = "/\s[Ii]{1}n\s/";
$content_replacements_to[] = " in ";

// Tidy up of
$content_replacements_from[] = "/\s[Oo]{1}f\s/";
$content_replacements_to[] = " of ";

// Tidy up for
$content_replacements_from[] = "/\s[Ff]{1}or\s/";
$content_replacements_to[] = " for ";

// Tidy up or
$content_replacements_from[] = "/\s[Oo]{1}r\s/";
$content_replacements_to[] = " or ";

// Tidy up and
$content_replacements_from[] = "/\s[Aa]{1}nd\s/";
$content_replacements_to[] = " and ";

// Tidy up to
$content_replacements_from[] = "/\s[Tt]{1}o\s/";
$content_replacements_to[] = " to ";

// Tidy up too
$content_replacements_from[] = "/\s[Tt]{1}oo\s/";
$content_replacements_to[] = " too ";

// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";

// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";

// Tidy up mm
$content_replacements_from[] = "/M[Mm]{1}/";
$content_replacements_to[] = "mm";

// Tidy up ize to ise
$content_replacements_from[] = "/([a-zA-Z]{2})ize{1}/";
$content_replacements_to[] = "$1ise";

// Tidy up izer to iser
$content_replacements_from[] = "/([a-zA-Z]{2})izer{1}/";
$content_replacements_to[] = "$1iser";

// Tidy up yze to yse
$content_replacements_from[] = "/([a-zA-Z]{2})yze{1}/";
$content_replacements_to[] = "$1yse";

// Tidy up ization to isation
$content_replacements_from[] = "/([a-zA-Z]{2})ization{1}/";
$content_replacements_to[] = "$1isation";

// Tidy up times symbol
$content_replacements_from[] = "/([0-9]{1})\s*[Xx]\s*([0-9A-Za-z]{1})/";
$content_replacements_to[] = "$1 × $2";

// Tidy up times symbol
$content_replacements_from[] = "/\×\;/";
$content_replacements_to[] = html_entity_decode("×");

// Tidy up inches
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nches/";
$content_replacements_to[] = "$1\"";

// Tidy up inch
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nch/";
$content_replacements_to[] = "$1\"";

// Make the replacements
$content = preg_replace($content_replacements_from, $content_replacements_to, $content);

This is obviously complicated and lengthy.

Does anyone know a better way of doing it or know of a class that is out there that can do this?

I would then also want to apply this to content within HTML if possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鲜血染红嫁衣 2024-12-18 16:20:01

正则表达式非常适合文本搜索和替换。您收到的报告表明还有改进的空间。但我的答案不是要优化这些,而是​​建议开始构建您自己的 StringCleaner 集,它可以执行不同的操作,但都具有相同的界面:

interface StringCleaner
{
    public function clean($string);
}

接下来,对于 HTML,一个想法我要做的就是创建一个 FilterIterator 来提供对所有文本节点的访问,这样它们就可以更轻松地使用任何标准清理器进行更改。

为了一次应用多个StringCleaner(并创建这些集合),我使用了复合模式 (通过从 SplObjectStore 扩展)它本身也是一个 StringCleaner

没有类定义的示例:

$cleanerTrim = new TrimCleaner();

$cleanerBasics = new RegexCleaner();

// Remove any spaces around slashes
$cleanerBasics->addRule('\s*\/\s*', '/');

// Remove any new lines or tabs
$cleanerBasics->addRule('[\r\n\t]', ' ');

// Tidy up joined full stops
$cleanerBasics->addRule('(\w+)\.(?!jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)(\w+)', '$1. $2');

// Remove any extra spaces
$cleanerBasics->addRule('\s{2,}', ' ');

// Remove single spaces
$cleanerBasics->addRule('^\s

如示例所示,当您将复杂性隐藏到具体的 StringCleaner 对象中时,您可以开始创建更多动态规则。这可以通过添加更多对不同于正则表达式的操作的 StringCleaner 类型进行扩展,TrimCleaner 中给出了一个非常简单的 trim 示例。

但可以肯定的是,正则表达式也非常强大。正如您在 RegexCleaner 中看到的,我已将每个正则表达式分隔符移至类本身中,因此当您定义规则时,无需一遍又一遍地键入它们。这只是另一个简单的示例,当您使用已定义的操作接口将替换封装到它自己的类中时,您可以简化事情。

完整示例

, ''); $cleanerInches = new RegexCleaner(); // Tidy up inches $cleanerInches->addRule('([0-9])\s*[Ii]nches', '$1"'); $cleaner = new CleanerComposite(); $cleaner->attach($cleanerBasics); $cleaner->attach($cleanerInches); $cleaner->attach($cleanerTrim); $htmlString = <<<HTML <html> <head> <title> hello world.hello earth. </title> </head> <body> <table><tr><td>test. </td></tr></table> <h1>Get it 1 more time.</h1> <p>When 12 inches were not enough; hickup.</p> </body> </html> HTML; // load HTML $dom = new DOMDocument(); $dom->preserveWhiteSpace = FALSE; $dom->loadHTML($htmlString); // create XPath $xpath = new DomXPath($dom); $it = new DOMTextWhiteSpaceFilter($xpath->query('//text()')); foreach($it as $node) { $node->data = $cleaner->clean($node->data); } // remove whitespace only nodes $it = new DOMTextWhiteSpaceFilter($xpath->query('//text()'), DOMTextWhiteSpaceFilter::WHITESPACE); foreach($it as $node) { $node->parentNode->removeChild($node); } $dom->formatOutput = true; echo $dom->saveHTML();

如示例所示,当您将复杂性隐藏到具体的 StringCleaner 对象中时,您可以开始创建更多动态规则。这可以通过添加更多对不同于正则表达式的操作的 StringCleaner 类型进行扩展,TrimCleaner 中给出了一个非常简单的 trim 示例。

但可以肯定的是,正则表达式也非常强大。正如您在 RegexCleaner 中看到的,我已将每个正则表达式分隔符移至类本身中,因此当您定义规则时,无需一遍又一遍地键入它们。这只是另一个简单的示例,当您使用已定义的操作接口将替换封装到它自己的类中时,您可以简化事情。

完整示例

Regular expressions are quite fine for text search and replace. The one's you've been given show that there is room for improvements. But my answer is not about optimizing those, instead I suggest to start building your own set of StringCleaner that can do different stuff, but all with the same interface:

interface StringCleaner
{
    public function clean($string);
}

Next to that, for the HTML, an idea I had is to create an FilterIterator that offers access to all text-nodes, so they can be changed with any standard cleaner more easily.

To apply multiple StringCleaner at once (and to create sets of those), I used the Composite Pattern (by extending from SplObjectStore) that is a StringCleaner on it's own as well.

The exmaple w/o the class defintions:

$cleanerTrim = new TrimCleaner();

$cleanerBasics = new RegexCleaner();

// Remove any spaces around slashes
$cleanerBasics->addRule('\s*\/\s*', '/');

// Remove any new lines or tabs
$cleanerBasics->addRule('[\r\n\t]', ' ');

// Tidy up joined full stops
$cleanerBasics->addRule('(\w+)\.(?!jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)(\w+)', '$1. $2');

// Remove any extra spaces
$cleanerBasics->addRule('\s{2,}', ' ');

// Remove single spaces
$cleanerBasics->addRule('^\s

As the example already shows, when you hide away the complexity into concrete StringCleaner objects, you can start to create more dynamical rules. This can be extended by adding more StringCleaner types that operate on something different than regular expression, a very simple example with trim is given in TrimCleaner.

But sure, the regular expressions are very powerful, too. As you can see with the RegexCleaner, I've moved each regular expressions delimiters into the class itself, so when you define the rules, you don't need to type them over and over again. That's just another simple example where you can simplify things when you encapsulate the replacement into a class of it's own with a defined interface for the action.

Full Example.

, ''); $cleanerInches = new RegexCleaner(); // Tidy up inches $cleanerInches->addRule('([0-9])\s*[Ii]nches', '$1"'); $cleaner = new CleanerComposite(); $cleaner->attach($cleanerBasics); $cleaner->attach($cleanerInches); $cleaner->attach($cleanerTrim); $htmlString = <<<HTML <html> <head> <title> hello world.hello earth. </title> </head> <body> <table><tr><td>test. </td></tr></table> <h1>Get it 1 more time.</h1> <p>When 12 inches were not enough; hickup.</p> </body> </html> HTML; // load HTML $dom = new DOMDocument(); $dom->preserveWhiteSpace = FALSE; $dom->loadHTML($htmlString); // create XPath $xpath = new DomXPath($dom); $it = new DOMTextWhiteSpaceFilter($xpath->query('//text()')); foreach($it as $node) { $node->data = $cleaner->clean($node->data); } // remove whitespace only nodes $it = new DOMTextWhiteSpaceFilter($xpath->query('//text()'), DOMTextWhiteSpaceFilter::WHITESPACE); foreach($it as $node) { $node->parentNode->removeChild($node); } $dom->formatOutput = true; echo $dom->saveHTML();

As the example already shows, when you hide away the complexity into concrete StringCleaner objects, you can start to create more dynamical rules. This can be extended by adding more StringCleaner types that operate on something different than regular expression, a very simple example with trim is given in TrimCleaner.

But sure, the regular expressions are very powerful, too. As you can see with the RegexCleaner, I've moved each regular expressions delimiters into the class itself, so when you define the rules, you don't need to type them over and over again. That's just another simple example where you can simplify things when you encapsulate the replacement into a class of it's own with a defined interface for the action.

Full Example.

送舟行 2024-12-18 16:20:01

可能有一种比许多正则表达式更好的方法来做到这一点,但如果没有其他人能想出更好的工具,那么我将如何使用 PHP 正则表达式来做到这一点。

可读性和易于维护几乎总是比速度更重要。 preg_replace 确实需要两个单独的字符串或数组来匹配和替换,但我们可以通过在使用时重新排列数据来解决这个问题。因此,我建议使用以下更具可读性的格式:

$content_replacements = array(array('From' => "/pattern 1/", 'To' => "$1 $2"),
                              array('From' => "/pattern 2/", 'To' => "$1,$2."));

它有一个很大的优点,如果您忘记“从”或“到”,您的模式和替换不会失去同步。

然后要运行所有这些,您可以使用循环:

   foreach ($content_replacements as $replacement)
   {
      $content = preg_replace($replacement['From'], $replacement['To'], $content);
   }

There is probably a better way to do this than lots of regexps, but if no-one else can come up with a better tool here is how i'd do it with PHP regexps.

Readability and ease of maintenance are almost always more important than speed. preg_replace does want two separate strings or arrays to match and replace from, but we can deal with that by rearranging our data at the point of use. So, I would recommend the following more readable format:

$content_replacements = array(array('From' => "/pattern 1/", 'To' => "$1 $2"),
                              array('From' => "/pattern 2/", 'To' => "$1,$2."));

It has a big advantage that if you forget one 'from' or 'to' you patterns and replacements don't get out of sync.

Then to run all of these you can use a loop:

   foreach ($content_replacements as $replacement)
   {
      $content = preg_replace($replacement['From'], $replacement['To'], $content);
   }
在梵高的星空下 2024-12-18 16:20:01

我认为您想要的第一件事是找到关于正则表达式的良好参考,并温习诸如哪些字符需要在字符类内部和外部转义(方括号:[])之类的事情。然后通过删除 {1} 并在有意义时使用不区分大小写的匹配('//i')来清理您的正则表达式。然后我提供了一些代码,您可以使用它们来合并一些大小写规则,希望对您有所帮助。

// capitalize first word of each sentence
function tidy_sentences($str){
    $tokens = array();
    foreach(preg_split('/([.?!]\s*)/', $str, 0, PREG_SPLIT_DELIM_CAPTURE) as $token){
        if(!preg_match('/([.?!]\s*)/', $token, $m)) $token[0] = strtoupper($token[0]);
        $tokens[] = $token;
    }
    return implode($tokens);
}

// apply capitalization rules to individual words
function tidy_words($str){
    $tokens = array();
    foreach(preg_split('/([^\w-])/', $str, 0, PREG_SPLIT_DELIM_CAPTURE) as $token){
        switch(true){
            // tokens you want to uppercase go here
            case preg_match('/^(uk|kw)$/i', $token, $m): $tokens[] = strtoupper($m[0]); break;
            // tokens you want to lowercase go here
            case preg_match('/^(rpm|fl|oz)$/i', $token, $m): $tokens[] = strtolower($m[0]); break;
            // tokens you want to capitalize first letter go here
            case preg_match('/^(maxi-sense|side-by-side|d-radius)$/i', $token, $m): $tokens[] = ucfirst($m[0]); break;
            default: $tokens[] = $token;
        }
    }
    return implode($tokens);
}

function tidy($str){
    return tidy_sentences(tidy_words($str));
}

echo tidy('foo bar rpm maxi-sense uk Fl OZ baz! pull out');
// => Foo bar rpm Maxi-sense UK fl oz baz! Pull out

I think the first thing you want is to find a good reference on regex and brush up on things like which chars need to be escaped inside and outside of character class (the square brackets: []). Then clean up your regex by removing {1}'s and using case-insensitive match ('//i') when it makes sense. Then I'm including some code that you can use to consolidate some of your capitalization rules, I hope it helps you.

// capitalize first word of each sentence
function tidy_sentences($str){
    $tokens = array();
    foreach(preg_split('/([.?!]\s*)/', $str, 0, PREG_SPLIT_DELIM_CAPTURE) as $token){
        if(!preg_match('/([.?!]\s*)/', $token, $m)) $token[0] = strtoupper($token[0]);
        $tokens[] = $token;
    }
    return implode($tokens);
}

// apply capitalization rules to individual words
function tidy_words($str){
    $tokens = array();
    foreach(preg_split('/([^\w-])/', $str, 0, PREG_SPLIT_DELIM_CAPTURE) as $token){
        switch(true){
            // tokens you want to uppercase go here
            case preg_match('/^(uk|kw)$/i', $token, $m): $tokens[] = strtoupper($m[0]); break;
            // tokens you want to lowercase go here
            case preg_match('/^(rpm|fl|oz)$/i', $token, $m): $tokens[] = strtolower($m[0]); break;
            // tokens you want to capitalize first letter go here
            case preg_match('/^(maxi-sense|side-by-side|d-radius)$/i', $token, $m): $tokens[] = ucfirst($m[0]); break;
            default: $tokens[] = $token;
        }
    }
    return implode($tokens);
}

function tidy($str){
    return tidy_sentences(tidy_words($str));
}

echo tidy('foo bar rpm maxi-sense uk Fl OZ baz! pull out');
// => Foo bar rpm Maxi-sense UK fl oz baz! Pull out
枕梦 2024-12-18 16:20:01

我认为最好的方法不一定是使用纯正则表达式。正则表达式旨在用作适合遵循特定模式的字符串的工具。您这里的用例似乎并不那么具体,所以我认为是时候探索不同的途径了。

我不确定您要清理的字符串有多复杂,或者内容通常是什么(也许它通常少于 100 个单词并且属于某个特定主题,因此常见词汇量很小)。但是,我认为更灵活(从长远来看可能更容易)的解决方案将涉及更简单的正则表达式(或正则表达式集)来识别字符串中的标记。识别出特定标记后,您可以采取非常具体的操作,并根据识别标记对字符串执行更高级的算法。

我相信 hakre 的 StringCleaner 类走在正确的道路上,其想法是从核心逻辑中抽象出不同类型的字符串,以帮助清理代码并使其更易于管理和灵活。

我知道这是非常通用和理论性的,但是你正在谈论的库的大小会很快膨胀 - 因为它涉及(基本上)自然语言处理,这是人工智能的一种形式。

I think the best way to do this isn't necessarily with pure regular expressions. Regular expressions are meant to be used as a tool to fit strings that follow a pretty specific pattern. Your use-case here doesn't seem to be that specific, so I think it's time to explore a different avenue.

I'm not sure how complicated the string you're trying to clean is, or what the content typically is (maybe it's typically less than 100 words and pertains to some specific topic, so the common vocabulary is small). However, I think a more flexible (and probably, easier in the long run) solution would involve a much simpler regular expression (or set of regular expressions) to identify tokens in your string. Once you've identified specific tokens, you can take very specific actions, and perform much more advanced algorithms on your strings, based on the identification tokens.

I believe hakre was on the right path with his StringCleaner class, with the idea of abstracting different types of strings away from your core logic to help clean up your code and make it more manageable and flexible.

I understand this is very generic and theoretical, but the size of the library you're talking about writing will blow up very fast - as it involves (basically) natural language processing, a form of artificial intelligence.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文