从 PHP 字符串中删除字符
我正在接受来自提要的字符串以在屏幕上显示,该字符串可能包含也可能不包含我想要过滤掉的一些垃圾。我根本不想过滤普通符号。
我想要删除的值如下所示:
我只想删除这个。相关技术是PHP。
建议表示赞赏。
I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
这是一个编码问题;您不应该尝试清除那些伪造的字符,而应该理解为什么您会收到乱码的字符。
尝试以 Unicode 形式获取数据,或者与您的 Feed 提供商达成协议,让双方使用相同的编码。
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.
谢谢你们的回复,伙计们。不幸的是,提交的内容存在以下问题:
由于明显的原因而错误:
This:
也使用已弃用的正则表达式的 ereg 形式,当我转换为 preg 时也不起作用,因为范围太大,正则表达式无法处理。此外,该范围内还有一些孔洞,垃圾可能会渗入其中。
这个建议:
虽然有效,但不好,因为我无法控制接收到的数据的编码方式。它来自外部来源。有时里面有垃圾,有时没有。
所以,我想出的解决方案相对来说比较脏,但在没有更强大的东西的情况下,我只是接受所有标准字母、数字和符号,并丢弃其余的。
目前看来这确实有效。解决方案如下:
如果有人有更好的想法,我仍然很想听听。干杯。
Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
This:
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
If anyone has any better ideas I'm still keen to hear them. Cheers.
您正在寻找超出字体可以显示的字形范围的字符。您可以找到字体可以显示的最大 unicode 值,然后创建一个正则表达式,用空字符串替换高于该值的任何内容。例如,
这将删除字符 255 以上的任何内容。
You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
This would strip anything above character 255.
这对您来说很难做到,因为您没有明确定义要过滤什么和要保留什么。通常,显示为空方块的字符是您使用的字体没有字形的任何字符,因此“像这样显示的内容:�”的定义非常不精确。
您最好准确地确定哪些字符有效(无论如何,对于任何类型的数据清理,这始终是一个好方法)并丢弃不属于其中之一的所有内容。 PHP filter 函数是执行此操作的一种可能性,具体取决于您所需的复杂性和稳健性级别。
That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.
如果您无法解决提要中的数据问题并需要过滤信息,那么这可能会有所帮助:
PHP5 filter_input 非常适合过滤输入字符串,并允许相当大的灵活性
您还可以在一个过滤器中过滤所有表单数据如果需要相同的过滤,则行:)
这里有一些很好的示例和更多信息:
http ://www.w3schools.com/PHP/func_filter_input.asp
PHP 站点提供了有关这些选项的更多信息:验证过滤器
If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters
看看这个问题 获取字符串中每个字节的值。 (这假设多字节重载已关闭。
)字节,您可以使用它们来确定这些“垃圾”字符实际上是什么。它们可能是由于误解字符串的编码、以错误的字体显示或其他原因造成的。将它们发布在这里,人们可以为您提供进一步的帮助。
Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.
尝试以下操作:
如果您发现一个设置可以使字符正确显示,那么您需要使用该编码对您的网站进行编码,或者将其从该编码转换为您在网站上使用的任何编码。
Try this:
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.
朋友们您好,
谢谢,
Chintu([电子邮件受保护])
Hello Friends,
Thanks,
Chintu([email protected])