如何将AutoHotkey UrldownLoadTofile请求的请求标头设置为SEC?

发布于 2025-02-09 11:21:37 字数 2853 浏览 2 评论 0原文

我正在尝试使用autohotkey ureldownloadtofile访问sec.gov上的数据:

UrlDownloadToFile, %sURL%, %sSaveAs%

where surl看起来像:

https://www.SEC.gov/Archives/edgar/data/<document ID>/xslFormDX01/primary_doc.xml

而不是所需的网页内容,我得到了标题“您的请求源自未确定的自动化工具”的响应,该标题说“请声明”通过更新您的用户代理以包括公司特定信息来流量。”并将我引导到 sec.gov/developer 以获取说明。该页面说:“ [SEC]的HTTPS文件系统允许全面访问SEC的Edgar ...公司,资金和个人的文件。有关完整文档,请参见访问edgar data ”。该页面上说:

Please declare your user agent in request headers:
Sample Declared Bot Request Headers:
- User-Agent: Sample Company Name AdminContact@<sample company domain>.com
- Accept-Encoding: gzip, deflate
- Host: www.sec.gov
We do not offer technical support for developing or debugging scripted processes.

因此,只要我设置这些标题,SEC似乎并不反对我进行自动下载(并将我的请求保持在每秒以下10次以下)。但是我该怎么做呢?如何将这些标题设置为ureldownloadtofile

我找到了一个用于下载SEC数据的开源工具,edgarwebr,其文档说:“建议该库的用户通过设置环境变量edgarwebr_user_agent来设置自定义用户代理。”这样做可以解决我的问题吗?如果是这样,我在哪里以及如何设置该环境变量?

我发现 github代码替换AHK的uroldownloadtofile,并提供了设置用户代理的功能。但是,我不明白如何实施该功能,也不伴随任何解释这一点的文档。它并没有说它写了什么语言,但看起来可能是C。我没有C编译器上的C编译器,但是如果我这样做,我不知道我在哪里称呼此功能。

我找到了 Microsoft forum q&amp; a 这似乎相关,建议使用 urlmksetsessionOption webbrowser :: navigate 。 For documentation, the link it gives is dead, but is

以上所有三件事似乎都涵盖了设置用户代理,但是上面引用的SEC指令还设置了Accept-codinghost,所以我还需要这样做吗?如果是这样,怎么样?

显然,我在我想做的事情的技术上都在这里。我希望有人可以通过满足SEC要求的方式来运行我的AHK脚本(或者是自动下载这些文件下载的替代技术(也许是自动下载这些文件的替代技术)。

I am attempting to use AutoHotkey UrlDownloadToFile to access data on SEC.gov:

UrlDownloadToFile, %sURL%, %sSaveAs%

where sURL looks like:

https://www.SEC.gov/Archives/edgar/data/<document ID>/xslFormDX01/primary_doc.xml

Instead of the desired webpage content, I'm getting a response with the title "Your Request Originates from an Undeclared Automated Tool", which says "Please declare your traffic by updating your user agent to include company specific information." and directs me to sec.gov/developer for instructions. That page says, "The [SEC]'s HTTPS file system allows comprehensive access to the SEC's EDGAR ... filings by corporations, funds, and individuals. For full documentation, please see Accessing EDGAR Data." That page says:

Please declare your user agent in request headers:
Sample Declared Bot Request Headers:
- User-Agent: Sample Company Name AdminContact@<sample company domain>.com
- Accept-Encoding: gzip, deflate
- Host: www.sec.gov
We do not offer technical support for developing or debugging scripted processes.

So it appears that the SEC does not object to my doing this automated download, as long as I set those headers (and keep my requests to under ten per second). But how to I do that? How can I set those headers for my call to UrlDownloadToFile?

I found an open source tool for downloading SEC data, edgarWebR, whose documentation says, "Users of this library are advised to set a custom user-agent by setting the environment variable EDGARWEBR_USER_AGENT." Would doing that solve my problem? If so, where and how do I set that environment variable?

I found GitHub code for a function that appears to be a replacement for AHK's UrlDownloadToFile and provides the ability to set the user agent. However, I don't understand how to implement that function and it doesn't come with any documentation that explains that. It doesn't say what language it's written in, but it looks like it might be C. I don't have a C compiler on this computer, but if I did, I don't know where I would call this function.

I found a Microsoft forum Q&A that seems relevant, suggesting using UrlMkSetSessionOption or WebBrowser::Navigate. For documentation, the link it gives is dead, but is preserved at Archive.org. Unfortunately, although the Q&A are specific to using URLDownloadToFile, I don't understand how to implement either suggested technique for my AHK script.

All of the above three things appear to cover setting the user agent, but the SEC instructions quoted above say to also set Accept-Encoding and Host, so do I also need to do that separately? If so, how?

Obviously, I'm over my head here in the technicalities of what I'm trying to do. I hope someone can explain what I need to do in order to run my AHK script (or perhaps an alternative technique for automating the download of those files) in a way that satisfies the SEC's requirements.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

悟红尘 2025-02-16 11:21:37

由于我没有得到任何答案,因此我使用解决方法使autohotkey在浏览器中打开每个URL(运行,%surl%),然后保存所得的网页(send alt-fa,发送文件名)并将其关闭(发送 alt-f4)。这主要起作用,尽管我必须手动解决许多问题。

这解决了我如何获取文件的问题,但是它没有回答我如何设置请求标题以正确地做的问题。因此,如果有人稍后出现并正确回答这个问题,我不会接受这个答案。

Since I didn't get any answers, I used a workaround to get AutoHotkey to open each URL in the browser (run, %sURL%) and then save the resulting webpage (send Alt-F-A, send filename) and close it (send Alt-F4). This worked mostly, although I had to fix a number of problems manually.

That solved my problem of how to get the files, but it didn't answer my question of how to set request headers in order to do it the right way. So I won't accept this answer in case someone comes along later and properly answers the question.

北座城市 2025-02-16 11:21:37

也许以下是您需要的。我对Sec.Gov不熟悉,因此您必须


根据您对该网站的经验(例如文档ID)提供输入和编辑。我添加了一种方法,可以根据我在Chrome的调试器控制台的响应标头中看到的cookie来获取cookie。 This is the cookie I got back that expires in about 8 hours (ak_bmsc=5EF62DAF005C6A95904F7ED7CCED9D78~000000000000000000000000000000~YAAQFC0tF+cmqpSBAQAAGJavoxCRBM3XBiKPn8e24CgXNLo+q6fI5C4VOJAkqdQdlHpsf2p08VHMcp10NI9rhpcoiI1i8CFQEvrR9b1SyCA9wN6sCAhoioLMvuqsYCw​​vVr21+AXb+XTvs65Kc70n5P7bEDYsvvWAFC5L+jdAQ+mLlcDTWF7twH2Lg1vYQAbxcje8OnbTDSldM2IGsWltH4ii1h73ZYPqlcTMWf/uh9AbKAlk+x50YQ1bUkTGdl+HZgulE6crPxHOqy4Q2PbXi2QwgylnYG9VTkmMzCnLi+clfWFQVuTz1i85x+YO9B4/eV3VHTYgFFk/SEmerp0kZbmzhT/RdFneLSaOOjg0KfHFR0tsKrlY1Ys= ;),这就是为什么每次执行这些请求的一批时都希望获得新的cookie的原因。我在在线Regex测试仪上测试了它,因此应该可以使用。

; GET COOKIE IDEA (EDIT)
Global reqCookie
secReq := getReq("www.sec.gov")
If (foundCookie := regexMatch(secReq.getAllResponseHeaders(), "U)Set\-?Cookie:[\s\r\n]+\K(\S+\;)", cookie)) {
    reqCookie := cookie1
} Else {
    Msgbox, Didn't find a cookie!
    Return
}

itWorked := urlDownloadToFileSEC(A_Desktop . "\SEC Download.xml", "1548527/000154852712000001")

Msgbox % (itWorked ? "It Worked!" : "Something went wrong")
Return

urlDownloadToFileSEC(FilePath, docID, timeoutMS=60000, bAsync=false)
    r := comObjCreate("WinHttp.WinHttpRequest.5.1")
    r.Open("GET", "https://www.SEC.gov/Archives/edgar/data/" . docID . "/xslFormDX01/primary_doc.xml", bAsync)
    r.SetTimeouts(timeoutMS, timeoutMS, timeoutMS, timeoutMS)
    r.SetRequestHeader("User-Agent", "Sample Company Name AdminContact@<sample company domain>.com")
     ; DO NOT KNOW WHAT YOU WOULD ACTUALLY WANT TO PUT IN THE ABOVE USER AGENT.
    r.SetRequestHeader("Accept-Encoding", "gzip, deflate")
    r.SetRequestHeader("Host", "www.sec.gov")
    r.SetRequestHeader("cookie", reqCookie)
    r.Send()
    if (bAsync) {
        Return r
    ; you have to use WaitForResponse method to check for it later.
    }
    if (r.Status="200") {
        fileText := r.ResponseText
        fileAppend, %fileText%, %filePath%
    } ; ERRORLEVEL CONTAINS WRITE SUCCESS.
    Return (r.Status="200" ? true : false)
 
 ; NEW FUNCTION!
getReq(URL, bAsync=false, timeouts=60000, reqHeadersMap="") {
    Req := comObjCreate("WinHttp.WinHttpRequest.5.1")
    Req.Open("GET", URL, bAsync)
    Req.SetTimeouts(timeouts, timeouts, timeouts, timeouts)
    if (isObject(reqHeadersMap)) {
        For key, value in reqHeadersMap {
            Req.SetRequestHeader(Key, Value)
        }
    }
    Req.Send()
    if (bAsync) {
        Return Req
    }
    ErrorLevel := (Req.Status==200 ? true : false)
    Return Req
}

Perhaps something like below is what you need. I have NO familiarity with SEC.gov, so you will


have to provide input and edits based on your experience with that website (like document ID). I added a method for getting a cookie that might work based on the cookie that I saw returned by sec.gov in the response headers in debugger console in Chrome. This is the cookie I got back that expires in about 8 hours (ak_bmsc=5EF62DAF005C6A95904F7ED7CCED9D78~000000000000000000000000000000~YAAQFC0tF+cmqpSBAQAAGJavoxCRBM3XBiKPn8e24CgXNLo+q6fI5C4VOJAkqdQdlHpsf2p08VHMcp10NI9rhpcoiI1i8CFQEvrR9b1SyCA9wN6sCAhoioLMvuqsYCwvVr21+AXb+XTvs65Kc70n5P7bEDYsvvWAFC5L+jdAQ+mLlcDTWF7twH2Lg1vYQAbxcje8OnbTDSldM2IGsWltH4ii1h73ZYPqlcTMWf/uh9AbKAlk+x50YQ1bUkTGdl+HZgulE6crPxHOqy4Q2PbXi2QwgylnYG9VTkmMzCnLi+clfWFQVuTz1i85x+YO9B4/eV3VHTYgFFk/SEmerp0kZbmzhT/RdFneLSaOOjg0KfHFR0tsKrlY1Ys=;), which is why you would want to get a new cookie each time you do a batch of these requests. I tested it on an online regex tester, so it should work.

; GET COOKIE IDEA (EDIT)
Global reqCookie
secReq := getReq("www.sec.gov")
If (foundCookie := regexMatch(secReq.getAllResponseHeaders(), "U)Set\-?Cookie:[\s\r\n]+\K(\S+\;)", cookie)) {
    reqCookie := cookie1
} Else {
    Msgbox, Didn't find a cookie!
    Return
}

itWorked := urlDownloadToFileSEC(A_Desktop . "\SEC Download.xml", "1548527/000154852712000001")

Msgbox % (itWorked ? "It Worked!" : "Something went wrong")
Return

urlDownloadToFileSEC(FilePath, docID, timeoutMS=60000, bAsync=false)
    r := comObjCreate("WinHttp.WinHttpRequest.5.1")
    r.Open("GET", "https://www.SEC.gov/Archives/edgar/data/" . docID . "/xslFormDX01/primary_doc.xml", bAsync)
    r.SetTimeouts(timeoutMS, timeoutMS, timeoutMS, timeoutMS)
    r.SetRequestHeader("User-Agent", "Sample Company Name AdminContact@<sample company domain>.com")
     ; DO NOT KNOW WHAT YOU WOULD ACTUALLY WANT TO PUT IN THE ABOVE USER AGENT.
    r.SetRequestHeader("Accept-Encoding", "gzip, deflate")
    r.SetRequestHeader("Host", "www.sec.gov")
    r.SetRequestHeader("cookie", reqCookie)
    r.Send()
    if (bAsync) {
        Return r
    ; you have to use WaitForResponse method to check for it later.
    }
    if (r.Status="200") {
        fileText := r.ResponseText
        fileAppend, %fileText%, %filePath%
    } ; ERRORLEVEL CONTAINS WRITE SUCCESS.
    Return (r.Status="200" ? true : false)
 
 ; NEW FUNCTION!
getReq(URL, bAsync=false, timeouts=60000, reqHeadersMap="") {
    Req := comObjCreate("WinHttp.WinHttpRequest.5.1")
    Req.Open("GET", URL, bAsync)
    Req.SetTimeouts(timeouts, timeouts, timeouts, timeouts)
    if (isObject(reqHeadersMap)) {
        For key, value in reqHeadersMap {
            Req.SetRequestHeader(Key, Value)
        }
    }
    Req.Send()
    if (bAsync) {
        Return Req
    }
    ErrorLevel := (Req.Status==200 ? true : false)
    Return Req
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文