R从带有引号文本的txt空格分隔文件中读取数据
我正在尝试将数据集加载到 R Studio 中,其中数据集本身以空格分隔,但它也包含引用文本中的空格,如 csv 文件中的空格。这是数据的 head
:
DOC_ID LABEL RATING VERIFIED_PURCHASE PRODUCT_CATEGORY PRODUCT_ID PRODUCT_TITLE REVIEW_TITLE REVIEW_TEXT
1 __label1__ 4 N PC B00008NG7N "Targus PAUK10U Ultra Mini USB Keypad, Black" useful "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2 __label1__ 4 Y Wireless B00LH0Y3NM Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable New era for batteries Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3 __label1__ 3 N Baby B000I5UZ1Q "Fisher-Price Papasan Cradle Swing, Starlight" doesn't swing very well. "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4 __label1__ 4 N Office Products B003822IRA Casio MS-80B Standard Function Desktop Calculator Great computing! I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5 __label1__ 4 N Beauty B00PWSAXAM Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6 __label1__ 3 N Health & Personal Care B00686HNUK Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe not sure I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7 __label1__ 4 N Toys B00NUG865W ESPN 2-Piece Table Tennis PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8 __label1__ 4 Y Beauty B00QUL8VX6 "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz." Great vitamin C serum "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9 __label1__ 4 N Health & Personal Care B004YHKVCM PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub wonderful detergent. "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."
问题是它看起来是制表符分隔的,但事实并非如此,例如 DOC_ID
= 1,其中 < 之间只有两个空格code>有用和“至少...”
,这样将sep =“/t”
传递给read.table
抛出一个错误,指出 line 1 没有10 个元素
,由于某种原因是不正确的,因为元素的数量应该是 9。以下是我传递的参数(没有原始路径):
read.table(file = "path ", sep ="\t", header = TRUE, strip.white = TRUE)
同样依赖引号也不是一个好的策略,因为有些行没有引用其文本,所以分隔符应该是这样的一个双重空间,与strip.white
应该可以正常工作,但是 read.table
只接受单字节分隔符。
所以问题是,如何在 R 中或使用任何其他可以将其充分转换为 csv 或至少制表符分隔文件的第三方软件来解析此类语料库?
I'm trying to load a dataset into R Studio, where the dataset itself is space-delimited, but it also contains spaces in quoted text like in csv files. Here is the head
of the data:
DOC_ID LABEL RATING VERIFIED_PURCHASE PRODUCT_CATEGORY PRODUCT_ID PRODUCT_TITLE REVIEW_TITLE REVIEW_TEXT
1 __label1__ 4 N PC B00008NG7N "Targus PAUK10U Ultra Mini USB Keypad, Black" useful "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2 __label1__ 4 Y Wireless B00LH0Y3NM Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable New era for batteries Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3 __label1__ 3 N Baby B000I5UZ1Q "Fisher-Price Papasan Cradle Swing, Starlight" doesn't swing very well. "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4 __label1__ 4 N Office Products B003822IRA Casio MS-80B Standard Function Desktop Calculator Great computing! I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5 __label1__ 4 N Beauty B00PWSAXAM Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6 __label1__ 3 N Health & Personal Care B00686HNUK Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe not sure I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7 __label1__ 4 N Toys B00NUG865W ESPN 2-Piece Table Tennis PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8 __label1__ 4 Y Beauty B00QUL8VX6 "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz." Great vitamin C serum "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9 __label1__ 4 N Health & Personal Care B004YHKVCM PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub wonderful detergent. "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."
Problem is that it looks tab-delimited but it is not, example would be DOC_ID
= 1, where there are only two spaces between useful
and "When least..."
, this way passing sep = "/t"
to read.table
throws an error saying that line 1 did not have 10 elements
, which for some reason is incorrect, because the number of elements should be 9. Here are the parameters that I'm passing(without the original path):
read.table(file = "path", sep ="\t", header = TRUE, strip.white = TRUE)
Also relying on quotes is not a good strategy, because some lines do not have their text quoted, so the delimiter should be something like a double space, which combined with strip.white
should work properly, but read.table
only accepts single byte delimiters.
So the question is how would you parse such corpus in R or with any other third party software that could convert it adequately to a csv or atleast a tab-delimited file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 python
pandas.read_csv(filename, sep='\t', header = 0, ...)
解析数据似乎已成功解析数据,从这一点来看,可以用它做任何事情。结束这一切。Parsing the data using python
pandas.read_csv(filename, sep='\t', header = 0, ...)
seems to have parsed the data successfully and from this point anything could be done with it. Closing this out.