Quiz: Policy Gradients and Actor-Critic Methods

Question 1 (True/False)

Policy-based reinforcement learning methods directly optimize a parameterized policy.

True
False

Question 2 (Multiple Choice)

What is the key difference between value-based and policy-based reinforcement learning methods?

Value-based methods use convolutional networks, policy-based do not
Policy-based methods learn a Q-function before training a policy
Value-based methods optimize value functions, policy-based methods optimize policies directly
Policy-based methods only work for discrete action spaces

Question 3 (Multiple Select)

Which of the following are components of the REINFORCE algorithm?

Sampling trajectories using the current policy
Computing gradients with respect to a value function
Applying the log-derivative trick to compute the gradient
Using a fixed reward function for supervised updates

Question 4 (Multiple Choice)

Why is the log-derivative trick used in the derivation of the policy gradient?

To allow closed-form maximization of the value function
To avoid computing derivatives of transition dynamics
To simplify the integral of the gradient of a probability density
To ensure the Q-function remains differentiable

Question 5 (True/False)

In REINFORCE, the gradient update increases the likelihood of action sequences that result in higher rewards.

True
False

Question 6 (Multiple Select)

What are some limitations of REINFORCE as discussed in the lecture?

It cannot compute gradients for deterministic policies
It suffers from high variance in gradient estimates
It requires knowledge of transition probabilities
It does not assign credit to individual actions within a trajectory

Question 7 (Multiple Choice)

What is the role of a baseline in policy gradient algorithms?

To normalize rewards across batches
To reduce variance in gradient estimation
To ensure deterministic transitions
To maximize log probability of actions

Question 8 (True/False)

Subtracting a baseline that does not depend on actions changes the mean of the policy gradient estimate.

True
False

Question 9 (Multiple Choice)

In the Advantage Actor-Critic method, the gradient is scaled by:

The average reward over all trajectories
The probability of selecting each action
The advantage function (Q minus V)
The entropy of the action distribution

Question 10 (Multiple Select)

Which elements are used in the actor-critic algorithm?

A learned Q-function
Policy gradient updates
Advantage estimates
A reward baseline derived from the policy’s V-function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quiz: Policy Gradients and Actor-Critic Methods

Question 1 (True/False)

Question 2 (Multiple Choice)

Question 3 (Multiple Select)

Question 4 (Multiple Choice)

Question 5 (True/False)

Question 6 (Multiple Select)

Question 7 (Multiple Choice)

Question 8 (True/False)

Question 9 (Multiple Choice)

Question 10 (Multiple Select)

FilesExpand file tree

17.5a - Questions.md

Latest commit

History

17.5a - Questions.md

File metadata and controls

Quiz: Policy Gradients and Actor-Critic Methods

Question 1 (True/False)

Question 2 (Multiple Choice)

Question 3 (Multiple Select)

Question 4 (Multiple Choice)

Question 5 (True/False)

Question 6 (Multiple Select)

Question 7 (Multiple Choice)

Question 8 (True/False)

Question 9 (Multiple Choice)

Question 10 (Multiple Select)