当前位置：文江博客话题详情

php正则表达式过滤掉垃圾

发布于 2024-08-11 05:57:07 字数 828 浏览 11 评论 0 原文

所以我有一个有趣的问题：我有一个字符串，并且大多数情况下我知道会发生什么：

http://www.someurl.com/st=?????????？

除本例外，“?”要么是大写字母，要么是数字。问题是，字符串中混入了垃圾：字符串被分成 5 或 6 部分，中间有很多垃圾：不可打印的字符、外来字符以及普通的普通字符。简而言之，看起来像这样的东西： Nyþ=mî;ëMÝ×nüqÏ

通常最后 8 个字符（？）在最后在一起，所以目前我只是PHP 获取最后 8 个字符并希望得到最好的结果。有时，这不起作用，所以我需要一个更强大的解决方案。

这个问题在技术上是无法解决的，但我认为最好的解决方案是从字符串末尾抓取大写或数字字符。如果我得到 8 个或更多，则假设这是正确的。否则，找到 st= 并向前抓取尽可能多的字符来填满 8 个字符配额。有没有正则表达式的方法可以做到这一点，或者我需要卷起袖子并采用嵌套循环风格？

更新：

为了消除一些混乱，我得到一个输入字符串，就像这：

[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????

除了垃圾位于字符串中不可预测的位置（除了末尾永远不是垃圾），并且具有不可预测的长度（至少，我无法在两者中找到模式）。通常这些 ? 都在一起，因此我只获取最后 8 个字符，但有时它们不是，这会导致一些数据丢失并返回垃圾。

原文

So I have an interesting problem: I have a string, and for the most part I know what to expect:

http://www.someurl.com/st=????????

Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ

Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.

The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will I need to roll up my sleeves and go nested-loop style?

update:

To clear up some confusion, I get an input string that's like this:

[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????

except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乙白 2024-08-18 05:57:07

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case

$clean = join(
    array_filter(
        str_split($var, 1),
        function ($char) {
            return (
                array_key_exists(
                    $char,
                    array_flip(array_merge(
                        range('A','Z'),
                        range('a','z'),
                        range((string)'0',(string)'9'),
                        array(':','.','/','-','_')
                    ))
                )
            );
        }
    )
);

哈，那是个笑话。这是适合您的正则表达式：

$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case

$clean = join(
    array_filter(
        str_split($var, 1),
        function ($char) {
            return (
                array_key_exists(
                    $char,
                    array_flip(array_merge(
                        range('A','Z'),
                        range('a','z'),
                        range((string)'0',(string)'9'),
                        array(':','.','/','-','_')
                    ))
                )
            );
        }
    )
);

Hah, that was a joke. Here's a regex for you:

$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

回复收藏 0 原文

沙沙粒小 2024-08-18 05:57:07

如前所述，问题是无法解决的。如果垃圾可以包含“普通旧普通字符”字符，并且垃圾可以落在字符串末尾，那么您无法知道此示例中的目标字符串是“ABCDEFGH”还是“BCDEFGHI”：

__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":

__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

回复收藏 0 原文

禾厶谷欠 2024-08-18 05:57:07

这些值代表什么？如果您想保留所有内容，而不必处理数据库中的垃圾，也许您应该使用 bin2hex()。

回复收藏 0 原文

沧笙踏歌 2024-08-18 05:57:07

您可以使用以下正则表达式：

if (preg_match('/[\'^£$%&*()}{@#~?><>,|=_+Ø-]/', $string ) ==1)

回复收藏 0 原文

~没有更多了~

关于作者

花开半夏魅人心

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

php正则表达式过滤掉垃圾

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

凡间太子

这个俗人

梦断已成空

emmm

心头的小情儿

mb_XdVNmmuJ

友情链接

php正则表达式过滤掉垃圾

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

凡间太子

这个俗人

梦断已成空

emmm

心头的小情儿

mb_XdVNmmuJ

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。