在 R 中使用正则表达式对数据进行分类

发布于 2024-10-18 06:36:01 字数 1867 浏览 7 评论 0原文

我有一个包含两列的文件，一列具有 HTTP 对象的内容类型，如 text/html、application/rar 等，另一列具有字节大小。

Content Type                                     Size
video/x-flv                                       100
image/jpeg                                        150
text/html                                         160
application/octet-stream                          200  
application/x-shockwave-flash                     ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8           ...

正如您所看到的，同一内容类型有许多变体，例如 application/x-javascript 和 application/x-javascript； charset=utf-8 等等。因此，我想创建另一个专栏来更一般地对它们进行分类。因此，这两个只是 web/javascript 等等。

 Content Type                                      Size      Category
    video/x-flv                                       100       web/video
    image/jpeg                                        150       web/image
    text/html                                         160       web/html
    application/octet-stream                          200       web/binary
    application/x-shockwave-flash                     ...       web/flash
    text/plain                                                  web/plaintext
    application/x-javascript                                    web/javascript
    video/x-msvideo                                             web/video
    text/xml                                                    web/xml
    text/css                                                    web/css
    text/html; charset=utf-8                                    web/html
    video/quicktime                                             web/video
    application/x-javascript; charset=utf-8                     web/javascript

我将如何在 R 中实现这一点，我想我需要使用某种正则表达式来实现这一点？

原文

I have a file with two columns, one has the content type of HTTP objects like text/html, application/rar etc and the other has the bytes size.

Content Type                                     Size
video/x-flv                                       100
image/jpeg                                        150
text/html                                         160
application/octet-stream                          200  
application/x-shockwave-flash                     ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8           ...

As you can see there are many variations of the same content type, such as application/x-javascript and application/x-javascript; charset=utf-8 and so on. So, I would like to create another column to categorize them more generically. So, that these two would just be web/javascript and so on.

 Content Type                                      Size      Category
    video/x-flv                                       100       web/video
    image/jpeg                                        150       web/image
    text/html                                         160       web/html
    application/octet-stream                          200       web/binary
    application/x-shockwave-flash                     ...       web/flash
    text/plain                                                  web/plaintext
    application/x-javascript                                    web/javascript
    video/x-msvideo                                             web/video
    text/xml                                                    web/xml
    text/css                                                    web/css
    text/html; charset=utf-8                                    web/html
    video/quicktime                                             web/video
    application/x-javascript; charset=utf-8                     web/javascript

How would I accomplish this in R and I presume I need to use regular expressions of some sort for this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦境 2024-10-25 06:36:01

有多种方法可以简化变量。在这里，我将使用 stringr 包进行字符串操作函数：

R> library(stringr)

首先，将内容类型变量复制到新的字符变量中：

R> d <- data.frame(type=c("video/x-flv", "image/jpeg","video/x-msvideo", "application/x-javascript; charset=utf-8", "application/x-javascript"))
R> d$type2 <- as.character(d$type)

这只会为您提供：

                                     type                                   type2
1                             video/x-flv                             video/x-flv
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

然后您可以处理新变量。您可以手动将某些类型值替换为另一个值：

R> d$type2[d$type2 == "video/x-flv"] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

您可以使用正则表达式匹配来替换所有匹配的值，例如“video”：

R> d$type2[str_detect(d$type2, ".*video.*")] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                                   video
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

或者您可以使用正则表达式替换来清理某些值。例如，删除“;”后面的所有内容在您的内容类型中：

R> d$type2 <- str_replace(d$type2, ";.*$", "")
R> d
                                     type                    type2
1                             video/x-flv                    video
2                              image/jpeg               image/jpeg
3                         video/x-msvideo                    video
4 application/x-javascript; charset=utf-8 application/x-javascript
5                application/x-javascript application/x-javascript

不过，请注意指令的顺序，因为您的结果很大程度上取决于它。

There are several ways you can simplify your variable. Here I will use the stringr package for string manipulation functions :

R> library(stringr)

First, copy your content type variable into a new character variable :

R> d <- data.frame(type=c("video/x-flv", "image/jpeg","video/x-msvideo", "application/x-javascript; charset=utf-8", "application/x-javascript"))
R> d$type2 <- as.character(d$type)

Which just gives you :

                                     type                                   type2
1                             video/x-flv                             video/x-flv
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

Then you can work on your new variable. You can just replace manually certain type value by another :

R> d$type2[d$type2 == "video/x-flv"] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

You can use regexp matching to replace all the values matching, for example, "video" :

R> d$type2[str_detect(d$type2, ".*video.*")] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                                   video
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

Or you can use regexp replacement to clean certain values. For example by removing everything behind the ";" in your content types :

R> d$type2 <- str_replace(d$type2, ";.*$", "")
R> d
                                     type                    type2
1                             video/x-flv                    video
2                              image/jpeg               image/jpeg
3                         video/x-msvideo                    video
4 application/x-javascript; charset=utf-8 application/x-javascript
5                application/x-javascript application/x-javascript

Be careful of the order of your instructions, though, as your result highly depends on it.

回复收藏 0 原文

岁月苍老的讽刺 2024-10-25 06:36:01

如果您必须手动完成，您可以将您的因素分配到相应的类别。在此示例中，我将字母表的前 13 个字母分组为“1”，后半个字母分组为“2”。

> x <- as.factor(sample(letters, 100, replace = TRUE))
> x
  [1] d n p n k l a x c n v p l o u e z m y x t r q b l n y s s m d u l l a d k
 [38] t a p x s g w i p l b s o t b s h h v c b j o p h f j m v d r m x o d l e
 [75] l f y l u e w f e e o s w s m v a z q l a t f z x s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> levels(x)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> levels(x) <- c(rep(1, 13), rep(2, 13))
> x
  [1] 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1
 [38] 2 1 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1
 [75] 1 1 2 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2
Levels: 1 2
> levels(x)
[1] "1" "2"

如果您的示例包含（仅）因素，即：

"video/x-flv" "image/jpeg" "video/x-msvideo" "application/x-javascript; charset=utf-8"

...您将像这样编写级别：

levels(obj) <- c("web/video", "web/image", "web/video", "web/javascript")

If you had to do it by hand, you could assign your factors into corresponding categories. In this example, I group first 13 letters of the alphabet as "1" and the second half of the letters as "2".

> x <- as.factor(sample(letters, 100, replace = TRUE))
> x
  [1] d n p n k l a x c n v p l o u e z m y x t r q b l n y s s m d u l l a d k
 [38] t a p x s g w i p l b s o t b s h h v c b j o p h f j m v d r m x o d l e
 [75] l f y l u e w f e e o s w s m v a z q l a t f z x s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> levels(x)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> levels(x) <- c(rep(1, 13), rep(2, 13))
> x
  [1] 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1
 [38] 2 1 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1
 [75] 1 1 2 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2
Levels: 1 2
> levels(x)
[1] "1" "2"

If your example contains (only) factors i.e.:

"video/x-flv" "image/jpeg" "video/x-msvideo" "application/x-javascript; charset=utf-8"

... you would code your levels like so:

levels(obj) <- c("web/video", "web/image", "web/video", "web/javascript")

回复收藏 0 原文

躲猫猫 2024-10-25 06:36:01

假设 DF 是我们的数据框。定义一个正则表达式，re 来匹配感兴趣的字符串，然后使用 gsubfn 包中的 strapply 来提取它们，并添加前缀 " web/" 到每个。在 strapply 语句中，我们已将 DF[[1]] 转换为字符，以防它是一个因子而不是字符向量。 NULL 条目不匹配，因此我们假设它们是 "web/binary" 。最后将任何出现的 "plain" 扩展为 "plaintext" ：

> library(gsubfn)
> re <- "(video|image|html|flash|plain|javascript|xml|css).*"
> short <- strapply(as.character(DF[[1]]), re, ~ paste("web", x, sep = "/"))
> DF$short <- sapply(short, function(x) if (is.null(x)) "web/binary" else x)
> DF$short <- sub("plain", "plaintext", DF$short)
> DF
                                   Content          short
1                              video/x-flv      web/video
2                               image/jpeg      web/image
3                                text/html       web/html
4                 application/octet-stream     web/binary
5            application/x-shockwave-flash      web/flash
6                               text/plain  web/plaintext
7                 application/x-javascript web/javascript
8                          video/x-msvideo      web/video
9                                 text/xml        web/xml
10                                text/css        web/css
11                text/html; charset=utf-8       web/html
12                         video/quicktime      web/video
13 application/x-javascript; charset=utf-8 web/javascript

有关 gsubfn 包的更多信息位于 http://gsubfn.googlecode.com 。

Assume that DF is our data frame. Define a regular expression, re to match the strings of interest and then use strapply in the gsubfn package to extract them, prefixing "web/" to each. In the strapply statement we have converted DF[[1]] to character just in case its a factor rather than a character vector. NULL entries were not matched so lets assume those are "web/binary" . Finally expand any occurrences of "plain" to "plaintext" :

> library(gsubfn)
> re <- "(video|image|html|flash|plain|javascript|xml|css).*"
> short <- strapply(as.character(DF[[1]]), re, ~ paste("web", x, sep = "/"))
> DF$short <- sapply(short, function(x) if (is.null(x)) "web/binary" else x)
> DF$short <- sub("plain", "plaintext", DF$short)
> DF
                                   Content          short
1                              video/x-flv      web/video
2                               image/jpeg      web/image
3                                text/html       web/html
4                 application/octet-stream     web/binary
5            application/x-shockwave-flash      web/flash
6                               text/plain  web/plaintext
7                 application/x-javascript web/javascript
8                          video/x-msvideo      web/video
9                                 text/xml        web/xml
10                                text/css        web/css
11                text/html; charset=utf-8       web/html
12                         video/quicktime      web/video
13 application/x-javascript; charset=utf-8 web/javascript

There is more info on the gsubfn package at http://gsubfn.googlecode.com .

回复收藏 0 原文

~没有更多了~