Skip to content

Project Submitted By - Rishit Aggarwal (23BAI10329) Lakshya Mangla (23BAI10814) Tanmay Singh (23BAI10328)#110

Open
Rishitagg1 wants to merge 1 commit into
MK-25source:mainfrom
Rishitagg1:main
Open

Project Submitted By - Rishit Aggarwal (23BAI10329) Lakshya Mangla (23BAI10814) Tanmay Singh (23BAI10328)#110
Rishitagg1 wants to merge 1 commit into
MK-25source:mainfrom
Rishitagg1:main

Conversation

@Rishitagg1

@Rishitagg1 Rishitagg1 commented Apr 8, 2026

Copy link
Copy Markdown

DDPG Robot path planning code

This project focuses on designing an intelligent robot navigation system using Reinforcement Learning (RL) in a continuous environment. The objective is to enable a robot to learn the optimal path from a starting position to a target location while avoiding obstacles.

The problem is modeled as a Markov Decision Process (MDP), where the robot acts as an agent that interacts with its environment by taking actions and receiving rewards. Unlike traditional path-planning algorithms, this approach allows the robot to learn through experience rather than relying on predefined rules.

To handle continuous state and action spaces, the project uses the Deep Deterministic Policy Gradient (DDPG) algorithm. DDPG is an advanced Actor-Critic method that combines deep learning with reinforcement learning to produce efficient and stable policies in continuous domains.

Copilot AI review requested due to automatic review settings April 8, 2026 17:48
@Rishitagg1 Rishitagg1 changed the title DDPG Robot path planning code Project Submitted By - Rishit Aggarwal (23BAI10329) Lakshya Mangla (23BAI10814) Tanmay Singh (23BAI10328) Apr 8, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a beginner-friendly, single-file DDPG example for continuous-space robot navigation and documents how to run it.

Changes:

  • Added ddpg_robot.py implementing a minimal continuous 2D environment + DDPG agent training loop.
  • Added README.md describing the environment, prerequisites, and usage.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 7 comments.

File Description
README.md Introduces project overview, environment details, and run instructions.
ddpg_robot.py Implements RobotEnv, replay buffer, actor/critic networks, DDPG training loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddpg_robot.py
Comment on lines +91 to +93
self.target_actor = Actor(state_dim, action_dim)
self.target_critic = Critic(state_dim, action_dim)

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Target networks are instantiated but never initialized with the online actor/critic weights. This makes the TD target start from unrelated random networks and can destabilize learning. After creating target_actor/target_critic, copy the actor/critic state into them (hard update) before training begins.

Copilot uses AI. Check for mistakes.
Comment thread ddpg_robot.py
Comment on lines +129 to +134
# Actor loss
actor_loss = -self.critic(s, self.actor(s)).mean()

self.actor_opt.zero_grad()
actor_loss.backward()
self.actor_opt.step()

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actor update backpropagates through the critic parameters as well, even though the critic optimizer step isn’t run here. This adds unnecessary compute/memory and leaves critic grads populated until the next critic update. Freeze critic parameters (or temporarily disable requires_grad) during the actor loss backward pass so gradients flow only through the action path.

Copilot uses AI. Check for mistakes.
Comment thread ddpg_robot.py
Comment on lines +148 to +155
episodes = 200

for ep in range(episodes):
state = env.reset()
total_reward = 0

for _ in range(100):
action = agent.select_action(state)

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training loop hardcodes 100 steps per episode, duplicating env.max_steps. This can drift if max_steps is changed and makes the environment config less authoritative. Use env.max_steps in the episode loop to keep these consistent.

Copilot uses AI. Check for mistakes.
Comment thread ddpg_robot.py
Comment on lines +101 to +104
def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0)
action = self.actor(state).detach().numpy()[0]
return action

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_action performs inference without a torch.no_grad() context, which incurs autograd overhead even though the result is detached. Wrap the forward pass in torch.no_grad() (and optionally set the actor to eval mode during action selection) to reduce overhead.

Copilot uses AI. Check for mistakes.
Comment thread ddpg_robot.py
Comment on lines +45 to +53
def __init__(self, capacity=10000):
self.buffer = []
self.capacity = capacity

def add(self, s, a, r, s2, d):
if len(self.buffer) >= self.capacity:
self.buffer.pop(0)
self.buffer.append((s, a, r, s2, d))

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplayBuffer evicts items with list.pop(0), which is O(n) per eviction and will slow down once capacity is reached. Consider using collections.deque with a maxlen (or a ring buffer) to make appends/evictions O(1).

Copilot uses AI. Check for mistakes.
Comment thread ddpg_robot.py
Comment on lines +118 to +123
# Critic loss
target_a = self.target_actor(s2)
target_q = self.target_critic(s2, target_a)
y = r + self.gamma * target_q * (1 - d)

critic_loss = nn.MSELoss()(self.critic(s, a), y.detach())

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Target Q computation is part of the target value and doesn’t need gradients, but it currently builds a graph and then detaches later. Wrap target_actor/target_critic forward passes used for TD targets in torch.no_grad() to avoid unnecessary graph construction and reduce memory use.

Copilot uses AI. Check for mistakes.
Comment thread ddpg_robot.py
Comment on lines +154 to +159
for _ in range(100):
action = agent.select_action(state)
next_state, reward, done = env.step(action)

agent.buffer.add(state, action, reward, next_state, done)
agent.train()

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDPG typically requires explicit exploration noise (e.g., Gaussian noise or OU noise) added to the selected action during training; without it, behavior can be nearly deterministic and learning may stall. Consider adding configurable action noise in the training loop while keeping evaluation noise-free.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants