从 Mathematica 中的 Web 设置用户代理导入

发布于 2024-11-08 19:57:17 字数 449 浏览 4 评论 0原文


当我使用 Mathermatica 连接到我的站点 (Import["mysite","Data"]) 并查看我的 Apache 日志时,我看到:
99.XXX.XXX.XXX - - [22/May/2011:19:36:28 +0200]“GET / HTTP/1.1”200 6268“-”“Mathematica/8.0.1.0.0 PM/1.3 .1"
我可以将其设置为这样(当我连接真实浏览器时):
99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200]“GET /favicon.ico HTTP/1.1”404 183“-”“Mozilla/5.0(X11;Linux) i686) AppleWebKit/534.24(KHTML,如 Gecko)Chrome/11.0.696.68 Safari/534.24"

when I connect to my site with Mathermatica (Import["mysite","Data"]) and look at my Apache log I see:
99.XXX.XXX.XXX - - [22/May/2011:19:36:28 +0200] "GET / HTTP/1.1" 200 6268 "-" "Mathematica/8.0.1.0.0 PM/1.3.1"
Could I set it to be something like this (when I connects with real browser):
99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200] "GET /favicon.ico HTTP/1.1" 404 183 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

孤独患者 2024-11-15 19:57:17

据我所知,您无法更改 Mathematica 中的用户代理字符串。我曾经使用代理服务器(CNTLM)让 Mathematica 与使用 NTLM 身份验证(Mathematica 不支持)的防火墙进行通信。 CNTLM 还允许您设置用户代理字符串。

您可以在 http://cntlm.sourceforge.net/ 找到它。基本上,您将此代理服务器设置为在您自己的计算机上运行,​​并在 Mathematica 网络设置中设置其端口号和 IP 地址。代理添加用户代理内容并处理 NTLM 身份验证。如果您没有 NTLM 防火墙,则不确定它如何工作。还有其他免费代理可能适合您。

编辑Squid http代理似乎可以做你想做的事。它具有 request_header_replace 配置指令,允许您更改请求标头的内容。

As far as I know you can't change the user agent string in Mathematica. I once used a proxy server (CNTLM) to get Mathematica to talk with a firewall which used NTLM authentication (which Mathematica doesn't support). CNTLM also allows you to set the user agent string.

You can find it at http://cntlm.sourceforge.net/. Basically, you set-up this proxy server to run on your own machine and set its port number and ip-address in the Mathematica network settings. The proxy adds user agent stuff and handles the NTLM authentication. Not sure how it works if you don't have a NTLM firewall. There are other free proxies around that might work for you.

EDIT The Squid http proxy seems to do what you want. It has the request_header_replace configuration directive which allows you to change the contents of request headers.

给妤﹃绝世温柔 2024-11-15 19:57:17

以下是通过 JLink 使用 Apache HTTP 客户端的方法:

Needs["JLink`"]

ClearAll@urlString
urlString[userAgent_String, url_String] :=
  JavaBlock@Module[{http, get}
  , http = JavaNew["org.apache.commons.httpclient.HttpClient"]
  ; http@getParams[]@setParameter["http.useragent", MakeJavaObject@userAgent]
  ; get = JavaNew["org.apache.commons.httpclient.methods.GetMethod", url]
  ; http@executeMethod[get]
  ; get@getResponseBodyAsString[]
  ]

您可以按如下方式使用此函数:

$userAgent =
  "Mozilla/5.0 (X11;Linux i686) AppleWebKit/534.24 (KHTML,like Gecko) Chrome/11.0.696.68 Safari/534.24";

urlString[$userAgent, "http://www.htttools.com:8080/"]

如果需要,您可以将结果提供给 ImportString

ImportString[urlString[$userAgent, "mysite"], "Data"]

使用更复杂的代码可以使用流式方法,但是除非目标 Web 资源非常大,否则上面采用的基于字符串的方法可能就足够好了。

我在 Mathematica 7 和 8 中尝试了这段代码,我希望它也能在 v6 中工作。请注意,不能保证 Mathematica 在未来版本中始终包含 Apache HTTP 客户端。

工作原理

尽管该解决方案是用 Mathematica 表达的,但它本质上是用 Java 实现的。 Mathematica 附带了内置的 Java 运行时环境,Mathematica 和 Java 之间的桥梁是一个名为 JLink

正如此类跨技术解决方案的典型情况一样,即使代码不多,也存在相当大的复杂性。详细讨论代码如何工作超出了本答案的范围,但将强调一些项目作为进一步阅读的建议。

该代码使用 Apache HTTP 客户端。之所以选择这个 Java 库,是因为它是作为标准 Mathematica 发行版中未公开的一部分提供的,而且它也恰好是 Import 似乎在内部使用的库。

urlString 的整个主体被包裹在 JavaBlock 中。这确保了通过协调 Java 和 Mathematica 内存管理器的活动来正确释放在操作过程中创建的任何 Java 对象。

JavaNew 用于创建相关的 Apache HTTP 客户端对象、HttpClientGetMethod。像 http.getParams() 这样的 Java 表达式在 JLink 中表示为 http@getParams[]。 Java 类和方法记录在 Apache HTTP 客户端文档中。

MakeJavaObject 的使用有些不寻常。在这种情况下,这是必需的,因为 Mathematica 字符串作为参数传递,其中需要 Java Object。如果需要 Java String,JLink 将自动创建一个。但当需要 Object 时,JLink 无法做出此推断,因此使用 MakeJavaObject 来给 JLink 提示。

URLTools 怎么样?

顺便说一句,我尝试回答这个问题的第一件事是使用 Utilities`URLTools`FetchURL。它看起来非常有前途,因为它采用了一个名为“RequestHeaderFields”的选项。唉,这不起作用,因为该函数的当前实现仅对 HTTP POST 动词使用该选项,而不是 GET。也许 Mathematica 的某些未来版本将支持 GET 选项。

Here is a way to use the Apache HTTP client through JLink:

Needs["JLink`"]

ClearAll@urlString
urlString[userAgent_String, url_String] :=
  JavaBlock@Module[{http, get}
  , http = JavaNew["org.apache.commons.httpclient.HttpClient"]
  ; http@getParams[]@setParameter["http.useragent", MakeJavaObject@userAgent]
  ; get = JavaNew["org.apache.commons.httpclient.methods.GetMethod", url]
  ; http@executeMethod[get]
  ; get@getResponseBodyAsString[]
  ]

You can use this function as follows:

$userAgent =
  "Mozilla/5.0 (X11;Linux i686) AppleWebKit/534.24 (KHTML,like Gecko) Chrome/11.0.696.68 Safari/534.24";

urlString[$userAgent, "http://www.htttools.com:8080/"]

You can feed the result to ImportString if desired:

ImportString[urlString[$userAgent, "mysite"], "Data"]

A streaming approach would be possible using more elaborate code, but the string-based approach taken above is probably good enough unless the target web resource is very large.

I tried this code in Mathematica 7 and 8, and I expect that it works in v6 as well. Beware that there is no guarantee that Mathematica will always include the Apache HTTP client in future releases.

How It Works

Despite being expressed in Mathematica, the solution is essentially implemented in Java. Mathematica ships with a Java runtime environment built-in and the bridge between Mathematica and Java is a component called JLink.

As is typical of such cross-technology solutions, there is a fair amount of complexity even when there is not much code. It is beyond the scope of this answer to discuss how the code works in detail, but a few items will be emphasized as suggestions for further reading.

The code uses the Apache HTTP client. This Java library was chosen because it ships as an unadvertised part of the standard Mathematica distribution -- and it also happens to be the one that Import appears to use internally.

The whole body of urlString is wrapped in JavaBlock. This ensures that any Java objects that are created over the course of operation are properly released by co-ordinating the activities of the Java and Mathematica memory managers.

JavaNew is used to create the relevant Apache HTTP client objects, HttpClient and GetMethod. Java expressions like http.getParams() are expressed in JLink as http@getParams[]. The Java classes and methods are documented in the Apache HTTP client documentation.

The use of MakeJavaObject is somewhat unusual. It is required in this case as a Mathematica string is being passed as an argument where a Java Object is expected. If a Java String was expected, JLink would automatically create one. But JLink is unable to make this inference when Object is expected, so MakeJavaObject is used to give JLink a hint.

What about URLTools?

Incidentally, the first thing I tried to answer this question was to use Utilities`URLTools`FetchURL. It looked very promising since it takes an option called "RequestHeaderFields". Alas, this did not work because the present implementation of that function uses that option only for HTTP POST verbs -- not GET. Perhaps some future version of Mathematica will support the option for GET.

他夏了夏天 2024-11-15 19:57:17

我非常懒,curl 比 J/Link 更灵活,代码更少,而且没有对象管理问题。这是将数据 (userPass) 发布到 url 并以 JSON 格式检索结果的示例。

Import["!curl -A Mozilla/4.0 --data " <> userPass <> " " <> url, "JSON"]

我将这种东西隔离在一个不纯的函数中(除非它是纯的),所以我知道它被污染了,但任何网络访问都是这样的。

因为我使用管道,所以 MMA 无法推断文件的类型。 ref/Import 提到 « Import["!prog","format"] 从管道导入数据. » 和 « 文件的格式默认是从其名称中的文件扩展名推断出来的,或者是通过 FileFormat 从其内容推断出来的。 » 因此,需要指定“CSV”、“JSON”等作为格式参数。否则你会看到一些奇怪的结果。

curl是一个用URL语法传输数据的命令行工具,支持DICT、FILE、FTP、FTPS、GOPHER、HTTP、HTTPS、IMAP、IMAPS、LDAP、LDAPS、POP3、POP3S、RTMP、RTSP、SCP、SFTP 、SMTP、SMTPS、TELNET 和 TFTP。 curl 支持 SSL 证书、HTTP POST、HTTP PUT、FTP 上传、基于 HTTP 表单的上传、代理、cookie、用户+密码身份验证(基本、摘要、NTLM、协商、kerberos...)、文件传输恢复、代理隧道和其他有用技巧的巴士负载。

来自curl 和 libcurl 欢迎页面

I'm extremely lazy and curl is more flexible in less code than J/Link, without the object management issues. This is an example of posting data (userPass) to a url and retrieving the result in JSON format.

Import["!curl -A Mozilla/4.0 --data " <> userPass <> " " <> url, "JSON"]

I isolate this kind of thing in an impure function (unless it is pure) so I know it's tainted, but any web access is that way.

Because I use a pipe, MMA cannot deduce the type of file. ref/Import mentions that « Import["!prog","format"] imports data from a pipe. » and « The format of a file is by default deduced from the file extension in its name, or by FileFormat from its contents. » As a result, it is necessary to specify "CSV", "JSON", etc. as the format parameter. You'll see some strange results otherwise.

curl is a command line tool for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos...), file transfer resume, proxy tunneling and a busload of other useful tricks.

From the curl and libcurl welcome page.

濫情▎り 2024-11-15 19:57:17

Mathematica 通过用户指定的代理服务器进行所有互联网连接。如果正如 Sjoerd 所建议的那样,设置一个工作量太大,您可能需要考虑用 C/C++ 编写调用,然后从 Mathematica 中调用它。我毫不怀疑有很多 C 库可以用几行代码完成您想要的事情。

有关在 Mathematica 中调用 C 代码的信息,请参阅 C 语言接口文档

Mathematica does all of its internet connectivity through a user specified proxy server. If, as Sjoerd suggested, setting one up is too much work, you might want to consider writing the call in C/C++, and then calling that from Mathematica. I don't doubt there are plenty of C libraries that do what you want in a few lines of code.

For calling C code within Mathematica, see the C Language Interface documentation

萌逼全场 2024-11-15 19:57:17

Mathematica 9 具有新的 URLFetch 函数。它有 UserAgent 选项。

Mathematica 9 has the new URLFetch function. It has the option UserAgent.

一腔孤↑勇 2024-11-15 19:57:17

您还可以使用 J/Link 发出 Web 请求或在命令行上调用curl 或wget。

You can also use J/Link to make your web requests or call curl or wget on the command line.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文