Policy-based reinforcement learning methods directly optimize a parameterized policy.
- True
- False
What is the key difference between value-based and policy-based reinforcement learning methods?
- Value-based methods use convolutional networks, policy-based do not
- Policy-based methods learn a Q-function before training a policy
- Value-based methods optimize value functions, policy-based methods optimize policies directly
- Policy-based methods only work for discrete action spaces
Which of the following are components of the REINFORCE algorithm?
- Sampling trajectories using the current policy
- Computing gradients with respect to a value function
- Applying the log-derivative trick to compute the gradient
- Using a fixed reward function for supervised updates
Why is the log-derivative trick used in the derivation of the policy gradient?
- To allow closed-form maximization of the value function
- To avoid computing derivatives of transition dynamics
- To simplify the integral of the gradient of a probability density
- To ensure the Q-function remains differentiable
In REINFORCE, the gradient update increases the likelihood of action sequences that result in higher rewards.
- True
- False
What are some limitations of REINFORCE as discussed in the lecture?
- It cannot compute gradients for deterministic policies
- It suffers from high variance in gradient estimates
- It requires knowledge of transition probabilities
- It does not assign credit to individual actions within a trajectory
What is the role of a baseline in policy gradient algorithms?
- To normalize rewards across batches
- To reduce variance in gradient estimation
- To ensure deterministic transitions
- To maximize log probability of actions
Subtracting a baseline that does not depend on actions changes the mean of the policy gradient estimate.
- True
- False
In the Advantage Actor-Critic method, the gradient is scaled by:
- The average reward over all trajectories
- The probability of selecting each action
- The advantage function (Q minus V)
- The entropy of the action distribution
Which elements are used in the actor-critic algorithm?
- A learned Q-function
- Policy gradient updates
- Advantage estimates
- A reward baseline derived from the policy’s V-function