如何将源代码嵌入到 pdb 中并让调试器使用它?
注意:我的目标关注点是 C# 使用常规 MSIL 来定位 CLR,以防有一些东西适用于此,但不适用于更一般的情况。< /em>
一些现有的源代码调试支持示例
最近发布了 Sourcepack 项目,它允许用户重写pdb 文件中的源路径指向不同的地点。当您拥有程序集的源代码,但不想尝试将其放入与构建时完全相同的文件系统位置时,这非常有用。
http://lowleveldesign.wordpress.com/2011/08/26/sourcepack- released/
对于开源项目,使用 http://www.symbolsource.org/ 作为一个让项目的用户轻松获取符号和源代码的方法是一个好主意。
问题
然而,很多时候,有些项目出于法律或方便的原因,使用这种方法不太可行。此外,可能正在调试项目的人员可能相对较少或受到限制。
默认情况下,项目的 pdb 包括指向磁盘上文件 (IIRC) 的指针,然后源索引可以添加将指针嵌入到源位置的功能(例如,在版本控制系统中),然后使用源服务器实际获取源的指针。
目标
事情似乎可以更简单(对于某些构建,例如调试和/或仅限内部),只需将实际源代码放入 pdb 中(实际上只是取消引用当前写入 PDB 中的指针)。看来您可以跳过整个源服务器部分(至少在理论上)并消除对调试时故事的一些依赖。是否将源代码存储为压缩的在很大程度上是正交的,但第一遍可能不会这样做,以使其更易于现有调试器的实现。
由于 PDB 匹配二进制故事已经非常好,因此将源放入 PDB 甚至比源服务器指针更好,因为指针可能会随着时间的推移而损坏(源控制系统移动,或更改为不同的系统,或无论如何),但 PDB 中的实际源代码“永远”是好的。
这与“源服务器”支持有何不同?
(这是在 Tigran 评论询问好处之后通过编辑添加的)
应该将其与使用“正常”源服务器的“正常”调试体验进行比较的“基准”场景今天的例子。在这种情况下,(AFAIK)调试引擎从 PDB(通过备用流)获取指针,然后使用注册的源服务器尝试通过该指针获取源。由于给定的程序集通常会包含多个源文件,因此要么有一个包含基本位置的指针,要么 PDB(或其他内容)中有多个指针,但这应该与此讨论正交。
对于需要保持源代码隐藏/不可访问的项目(例如,大多数 Microsoft 产品,包括 Windows、Office、Visual Studio 等),那么让 PDB 包含指针远远优于包含实际源代码(即使它是加密)。如果没有必要的网络访问和权限,这样的指针是没有意义的,因此这种方法意味着您可以将 PDB 发送给地球上的任何人,而不必担心他们能够访问您的源(最坏的情况下,他们可以一睹您的源如何我认为树是排列好的)。
然而,有两大项目(特别是构建)不存在这种“隐藏源代码”的好处。
第一个是仅由有权访问源代码的人使用的构建。在您自己的计算机上完成且永远不会离开该计算机的构建就是一个很好的例子,因为攻击者无论如何都需要从您的文件系统读取文件才能获取源代码,因此从一个文件 (.cs) 读取与从另一个文件 (. pdb)在攻击难度/向量方面的差异相对较小。同样,构建完成并推送到测试/暂存环境,其中访问计算机上的 pdb 的人员等于可以“正常”访问源的人员或其子集。
第二个是(有点明显)开源项目,项目的源代码已经对所有人开放,因此向任何人隐藏源代码没有任何好处。
请注意,这可以相对容易地扩展为包含加密形式的源(因为我们已经讨论了必须存储格式/编码数据),但是增加的复杂性将使这种情况可能不太有用而不是仅仅使用“普通”源服务器。
好处?
完成上述描述后,允许这样做的潜在好处包括(但不限于:)此时我脑海中浮现的这些:
- 无需处理设置源服务器支持。它可以正常工作(IJW),至少当/如果调试器知道查看 pdb 时。
- 与此同时,您仍然可以做一个“固定”源服务器,它只是一个提取源并将其反馈给调用者的虚拟服务器。这样的配置对于每个人来说都是相同的(例如使用本地主机),仍然消除了当前实际配置源服务器的需要
- 不需要包含“源索引”
- 由于构建无论如何都会读取源文件并写入 pdb 文件,因此我们只是修改 pdb 中写入的内容,而不会在构建时执行任何网络调用或读取我们尚未拥有的数据的性能命中记忆力。
- 在“本机”构建支持放入源代码之前,这可能是一个简单的构建后步骤,可能首先通过 Sourcepack 项目的一个小分支实现,因为它 已经完成了读取/修改 PDB 文件的工作:)
- 不依赖于具有源代码控制系统的团队/项目
- 不依赖于签入源代码控制系统的每个文件的特定版本(大多数人不会签入每个文件)他们在 IDE 中进行单一构建)
- 无需访问具有该文件的特定源代码控制系统
- 例如,在 DVCS 情况下,PDB 指针可能指向 git 或 Mercurial 等的某个“随机”实例,但不一定是您有权访问的实例
- 用于将该版本跟踪回您有权访问的源代码控制服务器实例(如果它存在的话)的源服务器工具尚不存在,据我所知)
- 如果项目终止(被删除)或移动,没有问题
- 例如,如果项目从以下一个移动到另一个:自托管、sourceforge、github、bitbucket、codeplex、code.google.com 等。
- 如果您正在调试的机器没有 (或不足)网络访问
- 例如,如果您在一个盒子中安装“网络 KVM”来调试问题,但它要么没有网络,要么只能与断开连接的网络通信,从而无法访问您的源代码控制服务器。
- 在极端情况下,能够从构建中恢复一些项目源。 ;)
注意:另一种方法是将源代码包含在实际程序集中(例如,作为资源),但 pdb 是更好的选择(很容易在没有 pdb 的情况下发布构建,如果源代码位于pdb,因为程序集是相同的代码和相同的大小等)
如何实现?
从表面上看,这种支持似乎并不太难添加,但我感觉这是因为我对所涉及的机制了解不够,而不是它实际上是一个简单的事情来实施。 :)
我的猜测是:
- 添加一个构建后步骤,该步骤将执行类似于 Sourcepack 的操作,但它不会更改指针,而是将其替换为实际源。
- 根据源服务器需要执行的操作,它可能需要添加前缀,或者实际源位于不同的备用数据流中,并且“指针”会更新为“source-in-pdb:ads-” foo.cs' 或其他什么。前缀或指针还可以包括源文件的存储方式(未压缩、gzip、bzip2 等,以及文件的编码)
- 实现一个“源服务器”,它实际上从相关 pdb 中提取源并将其返回。
- 不知道源服务器“API”是否有足够的信息来获取 PDB 的位置,更不用说它是否有权实际读取内容。
健全性检查?
抛开上面的废话,问题是:
- 这种事情已经存在了吗? (如果是这样,请提供指针!)
- 假设它还不存在,上面的内容作为第一次实现是否有意义?上述是否有跳过的陷阱或复杂性?
- 假设上述情况为“否”和“是”,是否存在一个现有项目在承担此任务方面有意义(它接近或在现有范围内)?
NOTE: my target concern is C# targeting the CLR with regular MSIL in case there's something that works for that but not in the more general case(s).
Some existing source debugging support examples
There was recently a release of the Sourcepack project which allows a user to rewrite the source paths in a pdb file to point at different locations. This is very useful when you have the source for the assembly, but don't want to try and get it into the exact same filesystem location(s) as when it was built.
http://lowleveldesign.wordpress.com/2011/08/26/sourcepack-released/
For open-source projects, using http://www.symbolsource.org/ as a way of making it simple for users of your project to get symbols and source is an excellent idea.
Problem
However, very often there are projects where either for legal or convenience reasons, using such an approach isn't very feasible. Also, the set of people that might be debugging the project may be relatively small or contained.
By default, the pdb's for a project include pointers to the files on disk (IIRC) and then source indexing can add the ability to embed pointers to the source locations (for instance, in a version control system), with a source server then using the pointers to actually fetch the source.
Goal
It seems like things could be simpler (for certain builds, like debug and/or internal-only) to just put the actual source into the pdb (effectively just dereferencing the pointer currently written in the PDB). It seems like then you can skip the entire source server part (at least in theory) and eliminate a few dependencies on the debug-time story. Whether to store the source as compressed or not is largely orthogonal, but a first pass would probably not do so in an effort to make it simpler to implement for existing debuggers.
Since the PDB-matching-binary story is already very good, putting the source into the PDB would be even better than a source server pointer, since the pointer can break over time (source control system moves, or changes to a different system, or whatever), but the actual source sitting in the PDB is good 'forever'.
How is this different than 'source server' support?
(this was added via edit after Tigran's comment asking what the benefits would be)
The 'baseline' scenario that this should be compared against is that of a 'normal' debugging experience using a 'normal' source server instance today. In that scenario, (AFAIK) the debugging engine gets a pointer from the PDB (via an alternate stream) then uses the registered source server(s) to attempt to get the source via that pointer. Since a given assembly is typically going to include multiple source files, there's either a single pointer that includes a base location or there are multiple pointers in the PDB (or something else), but that should be orthogonal to this discussion.
For a project where keeping the source hidden/inaccessible is desirable (most Microsoft products, for instance, including Windows, Office, Visual Studio, etc.), then having the PDB contain pointers is FAR superior to including actual source (even if it were encrypted). Such pointers are meaningless without the necessary network access and permissions, so such an approach means you can ship the PDB to anyone on the planet without worrying about them being able to access your source (worst-case, they get a glimpse into how your source tree is arranged, I would think).
However, there are 2 large sets of projects (and specifically, builds) where this 'hide the source' benefit doesn't exist.
The first are builds that are only used by people that have access to the source anyway. Builds done on your own machine that won't ever leave that machine are a great example, as an attacker would need to read files from your filesystem anyway to get the source, so reading from one file (.cs) vs. another (.pdb) is a relatively small difference in terms of attack difficulty/vector. Similarly, builds that are done and pushed to a test/staging environment where the people that access the pdb on machine are equal to or a subset of the people that can access the source 'normally'.
The second are (somewhat obviously) open-source projects, where the source for the project is already open for everyone anyway, so there's no benefit to hiding the source from anyone.
Note that this could be relatively easily extended to include the source in an encrypted form instead (since we're already talking about having to store format/encoding data as well), but the added complexity of that would make such a scenario likely less useful than just using a 'normal' source server.
Benefits?
With the above descriptions out of the way, the list of potential benefits to allowing this include (but are not limited to :) these that pop into my head at the moment:
- No need to deal with setting up source server support. It Just Works (IJW), at least when/if debuggers knew to look in the pdb.
- In the mean time, you could still do a 'fixed' source server which was just a dummy that extracted the source and fed it back to the caller. Such a configuration could be the same for everyone (using localhost, for instance), still eliminating the current need to actually configure a source server
- No need for the build to include 'source indexing'
- Since a build reads the source files and writes the pdb files anyway, we're just modifying what's written in the pdb and not taking any build-time perf hit for doing network calls or reading data we don't already have in memory.
- Until 'native' build support for putting the source in, it could be a simple post-build step, likely implemented at first via a small fork of the Sourcepack project since it
already does the work of reading/modifying PDB files :)
- No dependency on the team/project having a source control system
- No dependency on the particular version of each file being checked into the source control system (most people don't check in for every single build they do in their IDE)
- No need to have access to the particular source control system that has the file
- in the DVCS case, for instance, the PDB pointer may be to some 'random' instance of git or mercurial or whatever, not necessarily one you have access to
- the source server tooling to track that version back to the source control server instance(s) you do have access to (if it even exists there) doesn't yet exist AFAIK)
- No problem if the project dies (gets deleted) or moves
- for instance, if the project moves from one to another of: self-hosted, sourceforge, github, bitbucket, codeplex, code.google.com, etc.
- No problem if the machine you're debugging on has no (or insufficient) network access
- For instance, if you're doing a 'network KVM' into a box for debugging an issue but it either has no network or it can only talk to disconnected networks such that it can't access your source control server).
- in extreme case, ability to recover some of the project source from a build. ;)
NOTE: another approach would be including the source in the actual assembly (for instance, as a resource), but the pdb is a better choice (easy to ship a build without pdb's, no normal runtime perf hit if the source is in the pdb since the assembly is the same code and same size, etc)
How to implement?
On the surface of it, this kind of support doesn't seem like it would be too difficult to add, but I get the feeling this is because I don't really know enough about the mechanics involved instead of it actually being a simple thing to implement. :)
My guess would be something along the lines of:
- Add a post-build step that would do something similar to Sourcepack, but instead of changing the pointer, it would replace it with the actual source.
- Depending on what the source server needs to do, it might need to get prefixed, or the actual source would be in a different alternate data stream and the 'pointer' gets updated to something 'source-in-pdb:ads-foo.cs' or whatever. the prefix or pointer could include how the source file was stored as well (uncompressed, gzip, bzip2, etc, along with encoding of the file)
- Implement a 'source server' that actually extracts the source from the pdb in question and returns it back.
- No idea if the source server 'API' has enough info to get the location of the PDB, let alone whether it would have permission to actually read the contents.
Sanity check?
With the babble above out of the way, the questions are really:
- Does this kind of thing already exist? (and if so, please provide pointers!)
- Assuming it doesn't exist yet, does the above make sense as a first-pass implementation? Are there pitfalls or complexities the above skips over?
- Assuming "no" and "yes" for the above, is there an existing project that makes sense in terms of taking this on (it's close or in their existing scope)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我读过这篇文章并想总结一下我的理解以便清楚起见
作为一个以这种方式挖掘源代码的人,我喜欢用一个包来满足所有调试需求的想法。不过,这个提议有几个方面需要考虑。
第一个是将源代码实际嵌入到 PDB 中。这是非常可行的。 PDB本质上是一个轻量级文件数据库。它编码的内容有结构,但据我所知,您可以将任何您想要的内容放入某些插槽中(例如局部变量值/类型)。某些插槽可能有大小限制,但我确信您可以发明一种编码方案来将大文件分成块。
第二个方面是让调试器实际从 PDB 加载文件,而不是在磁盘上搜索文件。我对调试器的那部分不太熟悉,但据我了解,它只使用两条信息来定位文件
我相当确定这是它传递到符号服务器的唯一信息。这使得实现符号服务器变得不可行,因为它无法访问 PDB(当然假设我是对的)。
我四处寻找,希望有一个可以重写的 VS COM 组件,它允许您拦截给定路径的文件加载,但我找不到。
我认为可行的一种方法是将
但这并不完全是你想要的。
I've read over this and wanted to summarize my understanding for clarity
As someone who's done their fair share of digging for source code in this manner I like the idea of having one package for all your debugging needs. There are a couple of facets to consider about this proposal though.
The first is the actual embedding of the source code into the PDB. This is very doable. The PDB is essentially a light weight file database. There is structure to what it encodes but AFAIK you can put whatever you want into certain slots (local variable values / types for example). There may be size limitations for certain slots but I'm sure you could invent an encoding scheme to break large files up into chunks.
The second facet is having the debugger actually load the file from the PDB vs. searching for it on disk. I'm not as familiar with that part of the debugger but from what I understand it only uses 2 pieces of information to locate the file
I'm fairly certain this is the only information it passes onto a symbol server. This makes it unfeasible to implement a symbol server because it won't have access to the PDB (assuming of course I'm right).
I dug around hoping there was a VS COM component you could override which would allow you to intercept the loading of the file for a given path but I couldn't find one.
One approach I think would be feasible though would be
This wouldn't be quite what you want though.