在 R 中使用正则表达式对数据进行分类
我有一个包含两列的文件,一列具有 HTTP 对象的内容类型,如 text/html、application/rar 等,另一列具有字节大小。
Content Type Size
video/x-flv 100
image/jpeg 150
text/html 160
application/octet-stream 200
application/x-shockwave-flash ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8 ...
正如您所看到的,同一内容类型有许多变体,例如 application/x-javascript
和 application/x-javascript; charset=utf-8
等等。因此,我想创建另一个专栏来更一般地对它们进行分类。因此,这两个只是 web/javascript
等等。
Content Type Size Category
video/x-flv 100 web/video
image/jpeg 150 web/image
text/html 160 web/html
application/octet-stream 200 web/binary
application/x-shockwave-flash ... web/flash
text/plain web/plaintext
application/x-javascript web/javascript
video/x-msvideo web/video
text/xml web/xml
text/css web/css
text/html; charset=utf-8 web/html
video/quicktime web/video
application/x-javascript; charset=utf-8 web/javascript
我将如何在 R 中实现这一点,我想我需要使用某种正则表达式来实现这一点?
I have a file with two columns, one has the content type of HTTP objects like text/html, application/rar etc and the other has the bytes size.
Content Type Size
video/x-flv 100
image/jpeg 150
text/html 160
application/octet-stream 200
application/x-shockwave-flash ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8 ...
As you can see there are many variations of the same content type, such as application/x-javascript
and application/x-javascript; charset=utf-8
and so on. So, I would like to create another column to categorize them more generically. So, that these two would just be web/javascript
and so on.
Content Type Size Category
video/x-flv 100 web/video
image/jpeg 150 web/image
text/html 160 web/html
application/octet-stream 200 web/binary
application/x-shockwave-flash ... web/flash
text/plain web/plaintext
application/x-javascript web/javascript
video/x-msvideo web/video
text/xml web/xml
text/css web/css
text/html; charset=utf-8 web/html
video/quicktime web/video
application/x-javascript; charset=utf-8 web/javascript
How would I accomplish this in R and I presume I need to use regular expressions of some sort for this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有多种方法可以简化变量。在这里,我将使用
stringr
包进行字符串操作函数:首先,将内容类型变量复制到新的字符变量中:
这只会为您提供:
然后您可以处理新变量。您可以手动将某些类型值替换为另一个值:
您可以使用正则表达式匹配来替换所有匹配的值,例如“video”:
或者您可以使用正则表达式替换来清理某些值。例如,删除“;”后面的所有内容在您的内容类型中:
不过,请注意指令的顺序,因为您的结果很大程度上取决于它。
There are several ways you can simplify your variable. Here I will use the
stringr
package for string manipulation functions :First, copy your content type variable into a new character variable :
Which just gives you :
Then you can work on your new variable. You can just replace manually certain type value by another :
You can use regexp matching to replace all the values matching, for example, "video" :
Or you can use regexp replacement to clean certain values. For example by removing everything behind the ";" in your content types :
Be careful of the order of your instructions, though, as your result highly depends on it.
如果您必须手动完成,您可以将您的因素分配到相应的类别。在此示例中,我将字母表的前 13 个字母分组为“1”,后半个字母分组为“2”。
如果您的示例包含(仅)因素,即:
...您将像这样编写级别:
If you had to do it by hand, you could assign your factors into corresponding categories. In this example, I group first 13 letters of the alphabet as "1" and the second half of the letters as "2".
If your example contains (only) factors i.e.:
... you would code your levels like so:
假设 DF 是我们的数据框。定义一个正则表达式,
re
来匹配感兴趣的字符串,然后使用gsubfn
包中的strapply
来提取它们,并添加前缀" web/"
到每个。在strapply
语句中,我们已将DF[[1]]
转换为字符,以防它是一个因子而不是字符向量。NULL
条目不匹配,因此我们假设它们是"web/binary"
。最后将任何出现的"plain"
扩展为"plaintext"
:有关
gsubfn
包的更多信息位于 http://gsubfn.googlecode.com 。Assume that
DF
is our data frame. Define a regular expression,re
to match the strings of interest and then usestrapply
in thegsubfn
package to extract them, prefixing"web/"
to each. In thestrapply
statement we have convertedDF[[1]]
to character just in case its a factor rather than a character vector.NULL
entries were not matched so lets assume those are"web/binary"
. Finally expand any occurrences of"plain"
to"plaintext"
:There is more info on the
gsubfn
package at http://gsubfn.googlecode.com .