使用电源查询/BI从PDF提取句子的更好方法

发布于 2025-02-12 22:02:38 字数 5717 浏览 0 评论 0 原文

如果给出了URL,我一直在探索使用Power BI动态尝试从任何PDF中提取信息的想法。

在这种情况下,URL为 https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1B1-406B-4287-8A05-F81AA1B16D3F 这是对构型甲醛的安全评估。我希望能够从PDF文档中提取尽可能多的句子。

假设“”。“识别句子的结尾,这实际上效果很好,在段落以及其他一些欺骗段的段落中,可以将整个pdfinto句子分开,然后我可以搜索这些句子。

M代码(更新以改进提取):

    let
    
    Source = Pdf.Tables(Web.Contents("https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f"), [Implementation="1.3"]),
    #"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Kind] = "Page")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Name", "Kind"}),
    #"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
    Exclude={"Id"},
    List=List.Difference(Table.ColumnNames(#"Removed Columns1"),Exclude),
    MergeAllColumns= Table.AddColumn(#"Added Index","Custom", each Text.Combine(Record.ToList( Table.SelectColumns(#"Added Index",List){[Index]}), "  ")),
    #"Removed Other Columns" = Table.SelectColumns(MergeAllColumns,{"Id", "Custom"}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter",". ",".. #",Replacer.ReplaceText,{"Custom"}),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Cleaned Text" = Table.TransformColumns(#"Split Column by Delimiter1",{{"Custom", Text.Clean, type text}}),
    #"Added Index1" = Table.AddIndexColumn(#"Cleaned Text", "Index", 0, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index1", "Custom.1", each if Text.Contains([Custom], "#") then [Index] else null),
    #"Filled Down" = Table.FillDown(#"Added Custom",{"Custom.1"}),
    #"Grouped Rows" = Table.Group(#"Filled Down", {"Custom.1"}, {{"Custom", each Text.Combine([Custom], " "), type text}}),
    #"Split Column by Delimiter2" = Table.ExpandListColumn(Table.TransformColumns(#"Grouped Rows", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter2", "Custom.2", each if Text.Contains([Custom], "NOAEL") then [Custom] else null)
in
    #"Added Custom1"

因此:

”在此处输入图像描述”

对此:

尽管使用此句子分裂的运作良好”。更好的方法。

在某些情况下,句子无法正确拆分可以改进。例如,如果将句子的结尾连接到下一个EG Hello。您如何不会分开。 ...例如Apple 本身会被识别为分为。 ...例如 Apple

似乎建议使用正则表达式执行此操作href =“ https://stackoverflow.com/questions/57547297/regexp-in-power-query-using-using-javascript”> excel Regex替换功能这似乎不起作用。

fnregexextr3(不需要 \\ 只有 \ ):

    // regexReplace
let   regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
    in Output
in regexReplace

请随时查看PDF链接并提出任何建议以改善数据的捕获。

我将继续在这里进行任何进展。

I have been exploring the idea of using Power BI to dynamically try to extract information from any PDF if given a URL.

In this case the URL is https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f which is a safety assessment profile for formaldehyde. I wish to be able to extract as many sentences from a PDF document as possible.

Assuming ". " identifies the end of a sentence this actually works really well where paragraphs are concerned, along with some other trickery, to split the entire PDFinto sentences which I can then search.

M Code (updated for improved Extraction):

    let
    
    Source = Pdf.Tables(Web.Contents("https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f"), [Implementation="1.3"]),
    #"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Kind] = "Page")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Name", "Kind"}),
    #"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
    Exclude={"Id"},
    List=List.Difference(Table.ColumnNames(#"Removed Columns1"),Exclude),
    MergeAllColumns= Table.AddColumn(#"Added Index","Custom", each Text.Combine(Record.ToList( Table.SelectColumns(#"Added Index",List){[Index]}), "  ")),
    #"Removed Other Columns" = Table.SelectColumns(MergeAllColumns,{"Id", "Custom"}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter",". ",".. #",Replacer.ReplaceText,{"Custom"}),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Cleaned Text" = Table.TransformColumns(#"Split Column by Delimiter1",{{"Custom", Text.Clean, type text}}),
    #"Added Index1" = Table.AddIndexColumn(#"Cleaned Text", "Index", 0, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index1", "Custom.1", each if Text.Contains([Custom], "#") then [Index] else null),
    #"Filled Down" = Table.FillDown(#"Added Custom",{"Custom.1"}),
    #"Grouped Rows" = Table.Group(#"Filled Down", {"Custom.1"}, {{"Custom", each Text.Combine([Custom], " "), type text}}),
    #"Split Column by Delimiter2" = Table.ExpandListColumn(Table.TransformColumns(#"Grouped Rows", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter2", "Custom.2", each if Text.Contains([Custom], "NOAEL") then [Custom] else null)
in
    #"Added Custom1"

So from this:

enter image description here

to this:

enter image description here

Although this sentence splitting works pretty well using ". " I'm now being greedy and wondering if this can be done in an even better way.

There are some instances where the Sentence doesn't split correctly which could be improved. For example, if the end of a sentence is joined to the next e.g. Hello.How are you would not split. Whilst ...e.g. Apple itself is recognised would split into. ...e.g. and Apple.

Python - RegEx for splitting text into sentences (sentence-tokenizing)

Appears to propose doing this using regex however with this excel regex replace function this doesn't appear to work.

fnRegexExtr3 (doesn't require \\ just \):

    // regexReplace
let   regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
    in Output
in regexReplace

please feel free to look at the PDF link and propose any suggestions for improving the capturing of data.

I will continue to update here with any progress.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文