从 PHP 字符串中删除字符

发布于 2024-08-06 07:54:49 字数 129 浏览 2 评论 0原文

我正在接受来自提要的字符串以在屏幕上显示,该字符串可能包含也可能不包含我想要过滤掉的一些垃圾。我根本不想过滤普通符号。

我想要删除的值如下所示:

我只想删除这个。相关技术是PHP。

建议表示赞赏。

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.

The values I want to remove look like this: �

It is only this that I want removed. Relevant technology is PHP.

Suggestions appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

独自唱情﹋歌 2024-08-13 07:54:49

这是一个编码问题;您不应该尝试清除那些伪造的字符,而应该理解为什么您会收到乱码的字符。

尝试以 Unicode 形式获取数据,或者与您的 Feed 提供商达成协议,让双方使用相同的编码。

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.

Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.

九歌凝 2024-08-13 07:54:49

谢谢你们的回复,伙计们。不幸的是,提交的内容存在以下问题:

由于明显的原因而错误:

ereg_replace("[^A-Za-z0-9]", "", $string);

This:

s/[\u00FF-\uFFFF]//

也使用已弃用的正则表达式的 ereg 形式,当我转换为 preg 时也不起作用,因为范围太大,正则表达式无法处理。此外,该范围内还有一些孔洞,垃圾可能会渗入其中。

这个建议:

这是一个编码问题;您不应该尝试清除那些伪造的字符,而应该理解为什么您会收到乱码的字符。

虽然有效,但不好,因为我无法控制接收到的数据的编码方式。它来自外部来源。有时里面有垃圾,有时没有。

所以,我想出的解决方案相对来说比较脏,但在没有更强大的东西的情况下,我只是接受所有标准字母、数字和符号,并丢弃其余的。

目前看来这确实有效。解决方案如下:

$fixT = str_replace("£", "£", $string); 
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>@#\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);

如果有人有更好的想法,我仍然很想听听。干杯。

Thanks for the responses, guys. Unfortunately, those submitted had the following problems:

wrong for obvious reasons:

ereg_replace("[^A-Za-z0-9]", "", $string);

This:

s/[\u00FF-\uFFFF]//

which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.

This suggestion:

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.

while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.

So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.

This does seem to work for now. The solution is as follows:

$fixT = str_replace("£", "£", $string); 
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>@#\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);

If anyone has any better ideas I'm still keen to hear them. Cheers.

挖个坑埋了你 2024-08-13 07:54:49

您正在寻找超出字体可以显示的字形范围的字符。您可以找到字体可以显示的最大 unicode 值,然后创建一个正则表达式,用空字符串替换高于该值的任何内容。例如,

s/[\u00FF-\uFFFF]//

这将删除字符 255 以上的任何内容。

You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be

s/[\u00FF-\uFFFF]//

This would strip anything above character 255.

榕城若虚 2024-08-13 07:54:49

这对您来说很难做到,因为您没有明确定义要过滤什么和要保留什么。通常,显示为空方块的字符是您使用的字体没有字形的任何字符,因此“像这样显示的内容:�”的定义非常不精确。

您最好准确地确定哪些字符有效(无论如何,对于任何类型的数据清理,这始终是一个好方法)并丢弃不属于其中之一的所有内容。 PHP filter 函数是执行此操作的一种可能性,具体取决于您所需的复杂性和稳健性级别。

That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.

It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.

救赎№ 2024-08-13 07:54:49

如果您无法解决提要中的数据问题并需要过滤信息,那么这可能会有所帮助:

PHP5 filter_input 非常适合过滤输入字符串,并允许相当大的灵活性

filter_input(input_type, variable, filter, options) 

您还可以在一个过滤器中过滤所有表单数据如果需要相同的过滤,则行:)

这里有一些很好的示例和更多信息:

http ://www.w3schools.com/PHP/func_filter_input.asp

PHP 站点提供了有关这些选项的更多信息:验证过滤器

If you cant resolve the issue with the data from the feed and need to filter the information then this may help:

PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability

filter_input(input_type, variable, filter, options) 

You can also filter all of your form data in one line if it requires the same filtering :)

There are some good examples and more information about it here:

http://www.w3schools.com/PHP/func_filter_input.asp

The PHP site has more information on the options here: Validation Filters

无戏配角 2024-08-13 07:54:49

看看这个问题 获取字符串中每个字节的值。 (这假设多字节重载已关闭。

)字节,您可以使用它们来确定这些“垃圾”字符实际上是什么。它们可能是由于误解字符串的编码、以错误的字体显示或其他原因造成的。将它们发布在这里,人们可以为您提供进一步的帮助。

Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)

Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.

单调的奢华 2024-08-13 07:54:49

尝试以下操作:

  • 手动从源下载示例。
  • Notepad++ 或其他高级文本编辑器(Linux 上的 KATE 非常适合此操作)中打开它。
  • 尝试更改编码并从一种编码转换为另一种编码。

如果您发现一个设置可以使字符正确显示,那么您需要使用该编码对您的网站进行编码,或者将其从该编码转换为您在网站上使用的任何编码。

Try this:

  • Download a sample from the feed manually.
  • Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
  • Try changing the encoding and converting from one encoding to another.

If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.

长发绾君心 2024-08-13 07:54:49

朋友们您好,

     try this Regular Expression to remove unicode char from the string : 

     /*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/ 

谢谢,
Chintu([电子邮件受保护])

Hello Friends,

     try this Regular Expression to remove unicode char from the string : 

     /*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/ 

Thanks,
Chintu([email protected])

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文