R从带有引号文本的txt空格分隔文件中读取数据

发布于 2025-01-17 03:57:43 字数 4175 浏览 1 评论 0原文

我正在尝试将数据集加载到 R Studio 中,其中数据集本身以空格分隔,但它也包含引用文本中的空格,如 csv 文件中的空格。这是数据的 head

DOC_ID  LABEL   RATING  VERIFIED_PURCHASE   PRODUCT_CATEGORY    PRODUCT_ID  PRODUCT_TITLE   REVIEW_TITLE    REVIEW_TEXT
1   __label1__  4   N   PC  B00008NG7N  "Targus PAUK10U Ultra Mini USB Keypad, Black"   useful  "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2   __label1__  4   Y   Wireless    B00LH0Y3NM  Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable    New era for batteries   Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3   __label1__  3   N   Baby    B000I5UZ1Q  "Fisher-Price Papasan Cradle Swing, Starlight"  doesn't swing very well.    "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4   __label1__  4   N   Office Products B003822IRA  Casio MS-80B Standard Function Desktop Calculator   Great computing!    I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5   __label1__  4   N   Beauty  B00PWSAXAM  Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week   "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6   __label1__  3   N   Health & Personal Care  B00686HNUK  Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe    not sure    I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7   __label1__  4   N   Toys    B00NUG865W  ESPN 2-Piece Table Tennis   PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8   __label1__  4   Y   Beauty  B00QUL8VX6  "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz."   Great vitamin C serum   "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9   __label1__  4   N   Health & Personal Care  B004YHKVCM  PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub  wonderful detergent.    "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."

问题是它看起来是制表符分隔的,但事实并非如此,例如 DOC_ID = 1,其中 < 之间只有两个空格code>有用和“至少...”,这样将sep =“/t”传递给read.table抛出一个错误,指出 line 1 没有10 个元素,由于某种原因是不正确的,因为元素的数量应该是 9。以下是我传递的参数(没有原始路径):

read.table(file = "path ", sep ="\t", header = TRUE, strip.white = TRUE)

同样依赖引号也不是一个好的策略,因为有些行没有引用其文本,所以分隔符应该是这样的一个双重空间,与strip.white 应该可以正常工作,但是 read.table 只接受单字节分隔符。

所以问题是,如何在 R 中或使用任何其他可以将其充分转换为 csv 或至少制表符分隔文件的第三方软件来解析此类语料库?

I'm trying to load a dataset into R Studio, where the dataset itself is space-delimited, but it also contains spaces in quoted text like in csv files. Here is the head of the data:

DOC_ID  LABEL   RATING  VERIFIED_PURCHASE   PRODUCT_CATEGORY    PRODUCT_ID  PRODUCT_TITLE   REVIEW_TITLE    REVIEW_TEXT
1   __label1__  4   N   PC  B00008NG7N  "Targus PAUK10U Ultra Mini USB Keypad, Black"   useful  "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2   __label1__  4   Y   Wireless    B00LH0Y3NM  Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable    New era for batteries   Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3   __label1__  3   N   Baby    B000I5UZ1Q  "Fisher-Price Papasan Cradle Swing, Starlight"  doesn't swing very well.    "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4   __label1__  4   N   Office Products B003822IRA  Casio MS-80B Standard Function Desktop Calculator   Great computing!    I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5   __label1__  4   N   Beauty  B00PWSAXAM  Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week   "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6   __label1__  3   N   Health & Personal Care  B00686HNUK  Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe    not sure    I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7   __label1__  4   N   Toys    B00NUG865W  ESPN 2-Piece Table Tennis   PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8   __label1__  4   Y   Beauty  B00QUL8VX6  "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz."   Great vitamin C serum   "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9   __label1__  4   N   Health & Personal Care  B004YHKVCM  PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub  wonderful detergent.    "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."

Problem is that it looks tab-delimited but it is not, example would be DOC_ID = 1, where there are only two spaces between useful and "When least...", this way passing sep = "/t" to read.table throws an error saying that line 1 did not have 10 elements, which for some reason is incorrect, because the number of elements should be 9. Here are the parameters that I'm passing(without the original path):

read.table(file = "path", sep ="\t", header = TRUE, strip.white = TRUE)

Also relying on quotes is not a good strategy, because some lines do not have their text quoted, so the delimiter should be something like a double space, which combined with strip.white should work properly, but read.table only accepts single byte delimiters.

So the question is how would you parse such corpus in R or with any other third party software that could convert it adequately to a csv or atleast a tab-delimited file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

标点 2025-01-24 03:57:44

使用 python pandas.read_csv(filename, sep='\t', header = 0, ...) 解析数据似乎已成功解析数据,从这一点来看,可以用它做任何事情。结束这一切。

Parsing the data using python pandas.read_csv(filename, sep='\t', header = 0, ...) seems to have parsed the data successfully and from this point anything could be done with it. Closing this out.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文