按案例更改句子,其中两个单词被“卡住”。一起

发布于 2025-02-12 15:35:22 字数 1572 浏览 0 评论 0原文

我正在尝试清理从HTML提取的以下数据。

在一个句子的开头“卡住”到上一个单词时,有些句子并未完全用大写字母分开。

下图说明了我要实现的目标:

”在此处输入图像描述”

因此,如果有这样的句子,则本质上是: 男孩和球玩的女孩一起玩游戏机。这将分为以下:

The boy plays with the ball
The Girl plays with the Console

到目前为止使用实际数据(必须在Power BI中运行,作为使用html.tml。

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
    #"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
    #"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
    #"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
    #"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
    #"Expanded Text"

I am attempting to clean up the following data which has been extracted from HTML.

Some sentences haven't quite split correctly with the Capitalised word at the start of one sentence "stuck" to the preceding word.

The image below illustrates what I am trying to achieve:

enter image description here

So in essence if there is a sentence like: The boy plays with the ballThe Girl plays with the Console in a row. This would split to:

The boy plays with the ball
The Girl plays with the Console

M code so far with the actual data ( must be run in power BI as uses Html.Table function which is not available in excel).

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
    #"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
    #"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
    #"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
    #"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
    #"Expanded Text"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

dawn曙光 2025-02-19 15:35:23

图像看起来仍然不正确(Information overall不会分开),但是如果您想通过角色过渡分开,则可以从功能区中进行。

Image still looks incorrect (informationOverall is not split) but if you want to split by character transition, you can do so from the ribbon.

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文