加强学习中的状态定义

发布于 2025-02-05 04:01:10 字数 185 浏览 5 评论 0原文

在为加强学习中的特定问题定义状态时，如何确定要包括的内容以及为定义留下什么，以及如何在观察和状态之间设置差异。例如，假设代理商是在人力资源的背景下，并计划根据工作需求雇用一些工人，考虑到雇用雇用成本（假设预算有限）是一种状态（以下方式）＃工人，成本）一个很好的国家定义？总的来说，我不知道需要哪些信息，以及应该剩下什么信息，因为它是相当观察的。谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

病毒体 2025-02-12 04:01:10

我假设您正在将其作为RL问题提出，因为需求是未知数量。而且，也许[这是可选的标准]雇用它们的成本可能会考虑工人对最初未知的工作的贡献。但是，如果这两个量都是已知的或可以事先近似的，那么您只需运行计划算法即可解决问题[或仅仅是某种优化]。

话虽如此，这个问题中的状态可能像（#workers）一样简单。请注意，我不包括成本，因为代理必须经历成本，因此代理商不知道直到达到特定状态。根据问题的不同，您可能需要添加另一个“时间”或“找工作”的因素。

在几个设置中，RL铰链上的大多数理论结果，即环境为 markovian”> markovian 。如果没有这个假设，您可以在几项工作中获得一些作品，但是如果您可以以展示此属性的方式制定环境，那么您将有更多的工具可以使用。关键的想法是，代理可以根据当前状态（例如（#workers = 5，时间），可以决定采取哪种行动（在您的情况下，可能是：再雇用1个人。 = 6）。请注意，我们还没有区分工人，因此要解雇“一个”人，而不是解雇“特定”的人X。如果工人的功能不同，则可能需要添加其他几个代表当前雇用的工人的因素，并且当前在池中，但尚未被录用，因此像固定长度的布尔阵列一样。（我希望您了解如何形成状态表示形式，这可能会根据问题的细节而有所不同，这在您的问题中缺少）。

现在，一旦我们拥有州定义，动作定义A（雇用 /射击），我们就会在RL框架中使用MDP设置的“已知”数量。我们还需要一个可以在查询时为我们提供成本函数的环境（奖励功能 /成本功能），并告诉我们对某个状态采取一定措施的结果（过渡）。请注意，我们不一定需要事先知道这些奖励 /过渡功能，但是当我们查询特定（状态，行动）时，我们应该有一种获得这些价值的方法。

到您的最后一部分，观察与状态之间的差异。有更好的资源可以深入研究，但是从粗略的意义上讲，观察是代理的（任何代理，AI，人类等）的感觉数据。例如，在您的情况下，代理商可以计算当前雇用的工人数量（但没有能力区分工人）。

更正式的国家，一个真正的MDP状态必须是马尔可夫人的事物，并在其基本层面上捕捉环境。因此，也许为了确定公司的真实成本，代理商需要能够区分工人，每个工人的工作时间，他们从事的工作，工人之间的互动等。请注意，这些因素中的许多因素可能与您的任务无关，例如工人的性别。通常，人们想在哪些因素事先提出相关因素上形成一个良好的假设。

现在，即使我们可以同意，工人的任务（从事特定工作）也许是一个相关的功能，该功能决定雇用或解雇他们，但您的观察结果没有此信息。因此，您有两个选择，您可以忽略此信息很重要并与可用的内容一起工作，或者尝试推断这些功能。如果您的观察结果对于您的配方中的决策不完整，则通常将它们分类为部分可观察到的环境（并为其使用POMDP框架）。

我希望我澄清了几点，但是，所有这些背后都有巨大的理论，您问的是“提出国家定义”的问题是研究问题。（很像功能工程和机器学习中的功能选择）。

I am assuming you are formulating this as an RL problem because the demand is an unknown quantity. And, maybe [this is optional criteria] the Cost of hiring them may take into account a worker's contribution towards the job which is unknown initially. If however, both these quantities are known or can be approximated beforehand then you can just run a Planning algorithm to solve the problem [or just some sort of Optimization].

Having said this, the state in this problem could be something as simple as (#workers). Note I'm not including the cost, because cost must be experienced by the agent, and therefore is unknown to the agent until it reaches a specific state. Depending on the problem, you might need to add another factor of "time", or the "job-remaining".

Most of the theoretical results on RL hinge on a key assumption in several setups that the environment is Markovian. There are several works where you can get by without this assumption, but if you can formulate your environment in a way that exhibits this property, then you would have much more tools to work with. The key idea being, the agent can decide which action to take (in your case, an action could be : Hire 1 more person. Other actions could be Fire a person) based on the current state, say (#workers = 5, time=6). Note that we are not distinguishing between workers yet, so firing "a" person, instead of firing "a specific" person x. If the workers have differing capabilities, you may need to add several other factors each representing which worker is currently hired, and which are currently in the pool, yet to be hired so like a boolean array of a fixed length. (I hope you get the idea of how to form a state representation, and this can vary based on the specifics of the problem, which are missing in your question).

Now, once we have the State definition S, the action definition A (hire / fire), we have the "known" quantities for an MDP-setup in an RL framework. We also need an environment that can supply us with the cost function when we query it (Reward Function / Cost Function), and tell us the outcome of taking a certain action on a certain state (Transition). Note that we don't necessarily need to know these Reward / Transition function beforehand, but we should have a means of getting these values when we query for a specific (state, action).

Coming to your final part, the difference between observation and state. There are much better resources to dig deep into it, but in a crude sense, observation is an agent's (any agent, AI, human etc) sensory data. For example, in your case the agent has the ability to count number of workers currently employed (but it does not have an ability to distinguish between workers).

A state, more formally, a true MDP state must be something that is Markovian and captures the environment at its fundamental level. So, maybe in order to determine the true cost to the company, the agent needs to be able to differentiate between workers, working hours of each worker, jobs they are working at, interactions between workers and so on. Note that, much of these factors may not be relevant to your task, for example a worker's gender. Typically one would like to form a good hypothesis on which factors are relevant beforehand.

Now, even though we can agree that a worker's assignment (to a specific job) maybe a relevant feature which making a decision to hire or fire them, your observation does not have this information. So you have two options, either you can ignore the fact that this information is important and work with what you have available, or you try to infer these features. If your observation is incomplete for the decision making in your formulation we typically classify them as Partially Observable Environments (and use POMDP frameworks for it).

I hope I clarified a few points, however, there is huge theory behind all of this and the question you asked about "coming up with a state definition" is a matter of research. (Much like feature engineering & feature selection in Machine Learning).

回复收藏 0 原文

~没有更多了~