Skip to content

Latest commit

 

History

History
95 lines (65 loc) · 2.73 KB

File metadata and controls

95 lines (65 loc) · 2.73 KB

Quiz: Policy Gradients and Actor-Critic Methods


Question 1 (True/False)

Policy-based reinforcement learning methods directly optimize a parameterized policy.

  • True
  • False

Question 2 (Multiple Choice)

What is the key difference between value-based and policy-based reinforcement learning methods?

  • Value-based methods use convolutional networks, policy-based do not
  • Policy-based methods learn a Q-function before training a policy
  • Value-based methods optimize value functions, policy-based methods optimize policies directly
  • Policy-based methods only work for discrete action spaces

Question 3 (Multiple Select)

Which of the following are components of the REINFORCE algorithm?

  • Sampling trajectories using the current policy
  • Computing gradients with respect to a value function
  • Applying the log-derivative trick to compute the gradient
  • Using a fixed reward function for supervised updates

Question 4 (Multiple Choice)

Why is the log-derivative trick used in the derivation of the policy gradient?

  • To allow closed-form maximization of the value function
  • To avoid computing derivatives of transition dynamics
  • To simplify the integral of the gradient of a probability density
  • To ensure the Q-function remains differentiable

Question 5 (True/False)

In REINFORCE, the gradient update increases the likelihood of action sequences that result in higher rewards.

  • True
  • False

Question 6 (Multiple Select)

What are some limitations of REINFORCE as discussed in the lecture?

  • It cannot compute gradients for deterministic policies
  • It suffers from high variance in gradient estimates
  • It requires knowledge of transition probabilities
  • It does not assign credit to individual actions within a trajectory

Question 7 (Multiple Choice)

What is the role of a baseline in policy gradient algorithms?

  • To normalize rewards across batches
  • To reduce variance in gradient estimation
  • To ensure deterministic transitions
  • To maximize log probability of actions

Question 8 (True/False)

Subtracting a baseline that does not depend on actions changes the mean of the policy gradient estimate.

  • True
  • False

Question 9 (Multiple Choice)

In the Advantage Actor-Critic method, the gradient is scaled by:

  • The average reward over all trajectories
  • The probability of selecting each action
  • The advantage function (Q minus V)
  • The entropy of the action distribution

Question 10 (Multiple Select)

Which elements are used in the actor-critic algorithm?

  • A learned Q-function
  • Policy gradient updates
  • Advantage estimates
  • A reward baseline derived from the policy’s V-function