**Reinforcement studying (RL) **can do wonderful stuff. Most not too long ago, ChatGPT is fine-tuned on human feedback with PPO, a variant of a category of reinforcement studying algorithm referred to as **Coverage Gradient (PG)**. Understanding RL, particularly coverage gradient, might be non-trivial, significantly in case you like greedy intuitions like I do. On this put up, I’ll stroll by a thread of ideas that basically helped me perceive PG by ranging from extra acquainted supervised studying setting.

- We are going to begin by designing a easy supervised coaching process of a binary classification robotic by rewarding it +1 for proper solutions
- We are going to formulate the target for the process
- We are going to derive the gradient ascent formulation for the process (which is able to turn into the identical process as gradient descent with Cross Entropy)
- We are going to examine our process to RL settings and relate our gradient ascent to coverage gradient

- My objective is to offer a pleasant, intuitive assist to grasp PG. It’s useful in case you have a normal understanding of RL downside setting and know on excessive degree what PG is.
- I hope that will help you higher perceive the connection between RL with PG and supervised ML. So it’s useful if you realize about learn how to prepare a supervised ML algorithm with cross entropy loss perform.

## Coverage Gradient

In an RL downside, an **agent** interacts with an **surroundings** to be taught a **coverage. **The coverage tells the agent what to do in numerous **states** to maximise **reward**.

The thought of PG appears easy.

- The
**coverage**that guides agent habits at time*t*is*π_θ(a_t|s_t)*. - That is
**some type of perform**(usually a neural community) with parameter*θ*. - It
**takes in info of states***s_t*and spits out a**likelihood distribution of motion to take***a_t*. - Then it
**receives a reward***r(s_t, a_t)*. - When we now have historical past of lots of such cycles of motion and reward, we will replace parameter
*θ*with a purpose to maximize anticipated reward produced by actions generated from*π_θ*.

How will we do the replace? By…**gradient**! We replace the mannequin producing *π_θ* with the next gradient

## One thing Feels Off

This seems very acquainted. Once we prepare a neural community mannequin in good previous supervised studying, we additionally replace the mannequin parameters by doing the operation within the second line aka gradient descent (technically in PG case, since we’re maximizing an goal, it’s gradient ascent).

However this additionally feels very completely different. For those who have a look at its derivation process, you possibly can see it **takes a little bit of effort** to derive this equation. That’s very completely different from the extra intuitive manner of how we do supervised studying: feed an enter into neural community, get an output, examine it with goal and calculate a loss perform, hit backprop button, and we’re accomplished!

Additionally, for me, the **log time period** at all times appears popping out of nowhere. Though the identical on-line course within the hyperlink above walks by how will we get to the log time period, the method appears to simply be a bunch of math that’s appropriate however missing motivation.

What is precisely the distinction from supervised studying? It seems diving into this query gives an effective way of understanding PG. Moreover, it’s a good reminder for the **nature of some acquainted issues in supervised studying** that we do on a regular basis.

If we’re thrown with some loss features utilized in supervised ML, they’d “make sense” instantly. But it surely takes some extra efforts to grasp the place it comes from. For instance, the nice previous imply sq. loss intuitively make sense: it simply minimizes the gap between the prediction and goal. However there’re so many distance metrics; why sq. distance? You have to look deeper to see that mean square error is the byproduct of doing maximum likelihood and assuming the underlying population distribution to be normal.

Identical with one other good previous loss perform that we use on a regular basis: Cross Entropy. Whereas there are various good interpretations of what Cross Entropy does, **let’s attempt to construct it from probably the most rudimentary method**.

## Let’s Prepare a Classification Bot!

Say you wish to prepare a robotic to categorise canine and cat picture. It’s intuitive To coach it by rewarding the precise reply and punish it (or not reward it) for unsuitable solutions. Right here’s the way it’s accomplished:

- You
**give the robotic a picture**. Let’s name it*s.*This picture is sampled from a inhabitants distribution*D_s*

- The robotic will
**provide you with a solution**if it suppose it’s a canine picture (motion*a_dog*) or it’s a cat picture (motion*a_cat*)*.* - The robotic has
**its personal prediction**given the picture concerning the likelihood of the picture being canine or cat:*π_θ(a|s) = (a_dog, a_cat)*. For instance,*π_θ(a|s) = (0.9, 0.1)*means it thinks there’s 0.9 likelihood it’s a canine, and 0.1 most likely it’s a cat.

- However each time the robotic will solely provide you with a strong reply. It both says “it’s a canine” (
*a_dog*) or “it’s a cat”*(a_cat)*. Each time**it provides you a response**, the responses (actions) are**randomly sampled from distribution**produced by*π_θ(a|s)*:*a = (a_dog, a_cat) ~ π_θ(a|s)*.

- Then you’ll
**reward**the robotic (perhaps give it a deal with?) with reward of 1 when robotic solutions accurately. (*r(s,a) = 1*) There’s**no reward**(0 reward) when the reply is unsuitable. (*r(s,a) = 0*)

This course of is what I had in thoughts when first studying about supervised ML. Reward it when it’s appropriate. Punish it (or just no reward on this coaching course of we designed) for it’s unsuitable. In all probability probably the most intuitive technique to prepare one thing.

## Maximizing Goal

What’s our goal for the robotic? We would like its response to be appropriate as usually as potential. Extra exactly, we wish to discover optimum parameter *θ**, that produces a *π_θ(a|s), *such* *that over all potential occurrences of *s *(sampled from picture inhabitants distribution *D_s*)* *and *a *(sampled from distribution given *s* produced by mannequin *π_θ(a|s)*), we get** most common reward weighted by likelihood** of every occurence of *(s,a)*:

In different phrases, we’re **maximizing** the target perform *J(θ)* outlined as

## Gradient of Goal

Now we now have an goal perform, we might most likely attempt to maximize it by…gradient ascent! Particularly, we will optimize the perform by iteratively do

However how ought to we calculate the gradient, i.e. by-product of *J *respect to *θ*? It’s a bit tough on this case since

- The perform that we wish to take by-product of is an
**expectation**. - If the expectation will not be over a distribution depending on
*θ,*then by linearity of expectation, we will simply take the by-product of what’s inside expectation and depart expectation there. Nevertheless, on this case,tThe expectation is over*(s,a) ~ (D_s, π_θ(a|s)),*which relies*θ*. So the by-product will not be apparent. - One other mind-set about that is that
*J*(*θ*) modifications worth because the frequency modifications for a way usually we pattern*(s,a)*from a distribution**partially decided by***θ**.*We would like extra frequent occurrences of*s=canine picture*and*a=a_dog*(related pair for cat). How can we seize**modifications of**once we do gradient ascent?*θ*in the direction of this course

As well as, **ideally, we would like gradient to be within the type**

It’s because you prepare the bot from samples of interactions between robotic and also you. Each pattern consists of a triplet of *(s,a,r)*. We will due to this fact approximate this gradient by taking common of *f(θ,s,a,r) *with *N *samples that we gather (by regulation of huge quantity, i.e. do Stochastic Gradient Ascent):

We will then do gradient ascent by doing

Now let’s discover *f.*

## Discovering Gradient

To recap, we wish to begin from (1) to get (2) for some *f(θ,s,a,r)*.

Let’s first rewrite (1) with the **definition of expectation**:

That is principally the combination of reward weighted by likelihood over all potential *(s,a) *pair.

Now, what’s precisely the joint likelihood *P(s,a) *for one pair of *(s,a) *to seem? We will decompose it into the likelihood of picture pattern (*s*) showing, and the likelihood that the robotic randomly choose motion *a*.

Because the robotic randomly choose motion *a* from the robotic’s inside mannequin of prediction *π_θ(a|s)*, we now have

Out of all phrases inside the parentheses, solely *π_θ(a|s) *relies on *θ*. The opposite phrases are all fixed. So we will **transfer the gradient operation inside the integral signal** subsequent to this time period and get

Notice that we will additionally write the next. Nothing huge right here. Simply multiplying the unique lefthand facet by 1 written as a fraction and rearranging the phrases.

Changing it again, and rearranging a bit, we get

*P(s)π_θ(a|s) *seems acquainted. It’s *P(s,a) *that we decomposed earlier! Placing it again, we now have

Now we now have an integral and *P(s,a)*, we will…match it again into **the definition of expectation**!

Which is precisely The shape we wish to get in (2), the place *f *is the phrases inside the bracket!

You might have puzzled why we rewrite gradient of *π_θ(a|s) *within the clunky fraction earlier? The thought was to create a *π_θ(a|s) *time period (which we misplaced earlier by taking by-product of it), so we might **produce a P(s,a) time period once more**, and switch the combination again into an expectation!

## Constructing Cross Entropy

Now it’s magic time.

Don’t imagine me? Work from proper hand facet to left hand facet with chain rule. ([Optional] Facet ideas: So in case you are additionally confused concerning the motivation of the log time period in coverage gradient components, it’s a facet product of simplifying a clunky equation we get, with the intention of extracting a *π_θ(a|s) *time period to rework issues again to expectation.)

So we will simplify gradient of *J(θ) *a bit:

So every time we now have a batch of *(s,a)* as samples, we will do gradient ascent by

To deliver it to a extra acquainted type, Shifting the gradient signal outdoors of summation, we now have

We may even invert the signal by doing

Does it ring a bell? Let’s examine it with what we do when doing **gradient descent on cross entropy loss**.

Recall that cross entropy loss is

the place **y_i** is true label, a one sizzling vector (*y_i_1, y_i_2*) that describes whether or not a picture is cat and canine (both (0,1) or (1,0)). **y_hat_i** is prediction from a mannequin, a vector (*y_hat_i_1, y_hat_i_2*) the place the 2 entries some up two one.

Once we do **gradient descent** on this loss perform, we calculate the cross entropy loss perform for the batch, and hit the backprop button:

The variations between this expression and the gradient ascent specific we derived earlier are

To deliver the connection into phrases, it principally means: on pattern *x_i, *y_i

- The mannequin
**makes prediction**(*y_hat_i_1, y_hat_i_2*) given*x_i* - The mannequin
**randomly samples a response**from the anticipated distribution - We
**reward**response 1 with*y_i_1*and response 2 with*y_i_2*. - Since when the label is class 1,
*y_i_1 = 1, y_i_2 = 0*, we**reward the mannequin with 1**when mannequin accurately responds 1 and there’s**0 reward**when it incorrectly responds 0. Comparable with class 2.

And that’s precisely what we’ve been doing!

So to recap,

- We
**designed a easy coaching**setting the place we**reward****robotic**with 1 level when it solutions accurately and 0 level when it solutions incorrectly - We summerize what we wish to obtain in an
**goal perform**that describes reward robotic will get weighted by possibilities of its response - We discover the gradient descent process to
**maximize this goal perform** - And we get… the precise process that we use when
**coaching a mannequin by calculating Cross Entropy loss first and backproping by it**!

Now let’s flip our focus again to reinforcement studying setting. What are the variations between RL and the supervised ML setting.

## A number of Timesteps

The primary distinction is that RL often includes in **a number of states and a number of episodes**. In our setting, the robotic begins with enter of the picture, i.e. state *s. *After the robotic provides you again a solution primarily based on its prediction and collects the reward, the interactions between the robotic and you’re accomplished.

In opposite, in RL issues, brokers usually work together with surroundings in **a number of episodes**, and it’d transition to different states after the preliminary state.

The target perform then turns into

Placing in phrases, we maximize **common sum of rewards** **of all timesteps** **over all potential sequence (trajectory) of states and actions**, **weighted by the likelihood of every trajectory **occurring when actions are determined by parameter *θ*.

Notes that *p_θ *is the joint distribution of a sequence of states and actions when actions are determined by agent’s mannequin parameter *θ*. At every time step, the agent’s motion is determined by *π_θ(a_t|s_t)*, the place *π_θ *is a mannequin parameterized by *θ*. *p_θ* is a **excessive degree abstraction** of how possible it’s for a sequence of states and actions to happen when the agent makes determination primarily based on *π_θ *(i.e. *p_θ *is a placeholder for theoretically how usually the agent takes on the trajectory. Alternatively, *π_θ(a|s) *is the likelihood that the agent will take an motion at a selected timestep. We don’t truly simply know the worth *p_θ*, so later we’ll rewrite it with mannequin output *π_θ(a|s) *that we truly know).

Let’s examine it with the target we had earlier:

The **important variations** are

- We calculate expectation over a sequence of
*s*and*a*as an alternative of only one pair. - We maximize the sum of rewards from all timesteps within the trajectory as an alternative of solely the one-timestep reward from the picture and reply.

## Evaluating the Gradient Formulation

We will do the same manipulations to this goal to derive the gradient that we will use to replace *θ *at each timestep.

To recap, our objective is to seek out gradient of *J(θ) *within the following type for some *f*

Once we acquire a batch of sampled sequence of* s_1, a_1, r_1, … s_T, a_T, r_T*, we will then replace *θ *by way of Stochastic Gradient Ascent:

To simplify issues, let’s denote the sequence of state and with one variable *τ*.

So we hope to maximise the next goal perform

We will do the same manipulations that we did:

- Write
**expectation by way of integration**

**Take by-product**with respect to*θ*on*θ: p_θ(τ)*

- Rewrite the gradient of
*p_θ(τ)*as**product of**to get well the shape that*p_θ(τ)*and one thing else**defines an expectation**

So we acquire

Voila! It’s precisely what We wish to discover. To place in phrases, it means we’re updating *θ ***to the course of gradient of log likelihood** of samples *τ *underneath the actions decided by *θ, ***weighted by the entire reward** alongside the sampled *τ*. **That is precisely the formulation of Coverage Gradients.**

If we prolong the cross entropy analogy from earlier, sum of rewards are principally labels for the trajectory, and *p_θ(τ) *is how seemingly *τ *occurs* *underneath mannequin’s prediction. The coaching course of **encourages the mannequin to foretell distributions just like the distribution of rewards over completely different trajectories τ**

*.*(That is actually a mathematically correct assertion [correct me if I’m wrong]. If you realize about KL-Divergence, examine what’s being taken gradient of to KL-Divergence).

We will do some extra manipulations with conditional chances and definition of *p_θ(τ)*. This course of is nicely defined by this video (round 9:27). We lastly obtains the next, which rewrites *p_θ(τ) *as *π_θ(a_t|s_t) *that we truly is aware of worth of:

Notice That **when T = 1** (single episode), **that is identical as** **the gradient that we obtained in our setting earlier than**. In different phrases, supervised ML is a particular case of RL the place there’s just one episode, and reward is non-stochastic (see the subsequent part).

## One other Distinction: Estimating Rewards

One other distinction between RL and supervised ML is how a lot can we belief the rewards. In supervised ML, the reward are floor fact label that include the picture samples. We’re often 100% certain that the rewards are appropriate, and our robotic will alter its behaviors towards these labels.

Nevertheless, in RL issues, the rewards might be **extra stochastic** (think about while you play a sport, you might be in the identical place twice however get completely different scores). So we now have to **estimate the reward** for a selected state-action pair with historic reward as we work together with surroundings.

[Optional]Facet ideas: I used to be additionally considering if there’s center territory between supervised ML (the place labels/rewards are 100% trustable ) and RL (the place rewards are extra stochastic). It looks as if when labels are noisy (comprises some unsuitable labels), we’re type of within the center? So wouldpsuedo-labeling methodshare some taste as RL downside? Let me know your ideas.

Technically, in the long term, we must always have sufficient historic reward to grasp the typical reward habits, however within the quick run, small pattern quantity may produce **unstable**, biased estimate about it.

Worse, since agent habits is up to date by the reward collected, if we gather low-quality rewards, we’d go into and caught at a foul coverage. And it’ll take a very long time to get out of there and again on proper monitor.

That is one the challenges in RL that’s nonetheless an ongoing analysis space. Doing some manipulations to rewards and variants of coverage gradient reminiscent of TRPO and PPO are designed to deal with this situation higher, and have develop into extra generally used than vanilla PG.

## [Optional] One other Facet Thought: Comparability to Sequential Supervised ML

One distinction between our supervised ML setting and RL is that RL usually includes a number of timesteps. I instantly had the query: then how does RL differ from coaching a sequential mannequin like Transformer or LSTM?

The reply to this query positively is dependent upon the precise loss design of the coaching your favourite sequential mannequin.

For now, let’s say you prepare a sequence mannequin *f(***x_1,x_2,…x_T***) *to foretell** y_1, y_2…y_T*** *For instance, in a machine translation process, **x**’s might be phrases of enter English sentence, and **y***’s* are phrases output French sentence (every of **x_t, y_t **is a one sizzling vector illustration of the phrase).

We calculate the loss perform on every pattern by taking sum of cross entropy between every phrase output prediction and fact label. We the typical it over a batch of samples and do backprop like following

Placing again into the Coverage Gradient formulation, this to me is identical as calculating gradient of goal perform as

The distinction between this formulation and PG’s formulation is that we aren’t multiplying sum log likelihood of all timestep’s prediction with sum of rewards from all steps. As an alternative, we take pairwise product of log likelihood and reward of every timestep and sum them up.

This removes quite a lot of phrases thus drastically cut back variance of gradients, which could be what make coaching a Transformer/LSTM in supervised setting simpler than RL algorithm? (along with the non-stochastic rewards in supervised setting).

A way of decreasing variance of PG is launched in this video: Change the sum of reward of all time steps in PG to rewards to go (i.e. sum from *t’ = t to t’ = T*). This shares an identical taste as what completely different between PG and coaching a Transformer/LSTM in supervised setting. Whereas rewards to go technique make the agent to guage every state by potential future rewards, might we are saying that supervised sequential coaching make mannequin solely deal with present timestep’s correctness?

Additionally, I attempted to work backwards from this gradient expression and discover the unique *J(θ) *that outcomes this gradient expression, so we will extra straight interpret the target of supervised sequential coaching. However I received caught within the midway. Let me know in case you have any ideas.

## Acknowledgment

The connection between coverage gradient and cross entropy will not be my very own unique thought. Thanks this post for giving me ideas on increasing it to extra basically perceive what cross entropy and coverage gradient are doing.