Project Submitted By - Rishit Aggarwal (23BAI10329) Lakshya Mangla (23BAI10814) Tanmay Singh (23BAI10328)#110
Project Submitted By - Rishit Aggarwal (23BAI10329) Lakshya Mangla (23BAI10814) Tanmay Singh (23BAI10328)#110Rishitagg1 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a beginner-friendly, single-file DDPG example for continuous-space robot navigation and documents how to run it.
Changes:
- Added
ddpg_robot.pyimplementing a minimal continuous 2D environment + DDPG agent training loop. - Added
README.mddescribing the environment, prerequisites, and usage.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| README.md | Introduces project overview, environment details, and run instructions. |
| ddpg_robot.py | Implements RobotEnv, replay buffer, actor/critic networks, DDPG training loop. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.target_actor = Actor(state_dim, action_dim) | ||
| self.target_critic = Critic(state_dim, action_dim) | ||
|
|
There was a problem hiding this comment.
Target networks are instantiated but never initialized with the online actor/critic weights. This makes the TD target start from unrelated random networks and can destabilize learning. After creating target_actor/target_critic, copy the actor/critic state into them (hard update) before training begins.
| # Actor loss | ||
| actor_loss = -self.critic(s, self.actor(s)).mean() | ||
|
|
||
| self.actor_opt.zero_grad() | ||
| actor_loss.backward() | ||
| self.actor_opt.step() |
There was a problem hiding this comment.
Actor update backpropagates through the critic parameters as well, even though the critic optimizer step isn’t run here. This adds unnecessary compute/memory and leaves critic grads populated until the next critic update. Freeze critic parameters (or temporarily disable requires_grad) during the actor loss backward pass so gradients flow only through the action path.
| episodes = 200 | ||
|
|
||
| for ep in range(episodes): | ||
| state = env.reset() | ||
| total_reward = 0 | ||
|
|
||
| for _ in range(100): | ||
| action = agent.select_action(state) |
There was a problem hiding this comment.
The training loop hardcodes 100 steps per episode, duplicating env.max_steps. This can drift if max_steps is changed and makes the environment config less authoritative. Use env.max_steps in the episode loop to keep these consistent.
| def select_action(self, state): | ||
| state = torch.FloatTensor(state).unsqueeze(0) | ||
| action = self.actor(state).detach().numpy()[0] | ||
| return action |
There was a problem hiding this comment.
select_action performs inference without a torch.no_grad() context, which incurs autograd overhead even though the result is detached. Wrap the forward pass in torch.no_grad() (and optionally set the actor to eval mode during action selection) to reduce overhead.
| def __init__(self, capacity=10000): | ||
| self.buffer = [] | ||
| self.capacity = capacity | ||
|
|
||
| def add(self, s, a, r, s2, d): | ||
| if len(self.buffer) >= self.capacity: | ||
| self.buffer.pop(0) | ||
| self.buffer.append((s, a, r, s2, d)) | ||
|
|
There was a problem hiding this comment.
ReplayBuffer evicts items with list.pop(0), which is O(n) per eviction and will slow down once capacity is reached. Consider using collections.deque with a maxlen (or a ring buffer) to make appends/evictions O(1).
| # Critic loss | ||
| target_a = self.target_actor(s2) | ||
| target_q = self.target_critic(s2, target_a) | ||
| y = r + self.gamma * target_q * (1 - d) | ||
|
|
||
| critic_loss = nn.MSELoss()(self.critic(s, a), y.detach()) |
There was a problem hiding this comment.
Target Q computation is part of the target value and doesn’t need gradients, but it currently builds a graph and then detaches later. Wrap target_actor/target_critic forward passes used for TD targets in torch.no_grad() to avoid unnecessary graph construction and reduce memory use.
| for _ in range(100): | ||
| action = agent.select_action(state) | ||
| next_state, reward, done = env.step(action) | ||
|
|
||
| agent.buffer.add(state, action, reward, next_state, done) | ||
| agent.train() |
There was a problem hiding this comment.
DDPG typically requires explicit exploration noise (e.g., Gaussian noise or OU noise) added to the selected action during training; without it, behavior can be nearly deterministic and learning may stall. Consider adding configurable action noise in the training loop while keeping evaluation noise-free.
DDPG Robot path planning code
This project focuses on designing an intelligent robot navigation system using Reinforcement Learning (RL) in a continuous environment. The objective is to enable a robot to learn the optimal path from a starting position to a target location while avoiding obstacles.
The problem is modeled as a Markov Decision Process (MDP), where the robot acts as an agent that interacts with its environment by taking actions and receiving rewards. Unlike traditional path-planning algorithms, this approach allows the robot to learn through experience rather than relying on predefined rules.
To handle continuous state and action spaces, the project uses the Deep Deterministic Policy Gradient (DDPG) algorithm. DDPG is an advanced Actor-Critic method that combines deep learning with reinforcement learning to produce efficient and stable policies in continuous domains.