This research focuses on solving the Capacitated Vehicle Routing Problem (CVRP), an NP-hard optimization problem critical for logistics companies aiming to minimize delivery costs while adhering to vehicle capacity constraints. Traditional approaches, such as heuristics and exact algorithms, struggle with scalability or accuracy. Leveraging reinforcement learning (RL), we implemented two strategies: a value-based Deep-Q approach and a policy-based REINFORCE method. The Deep-Q network approximates Q-values for state-action pairs, utilizing experience replay and Huber loss for stability and efficiency. The REINFORCE approach optimizes policies directly, bypassing temporal difference calculations and using cumulative rewards for policy updates.
Experiments were conducted on VRP instances with 20 and 50 customers, benchmarking against a greedy algorithm and a state-of-the-art attention-based solution. Both RL methods outperformed the greedy baseline, with REINFORCE achieving superior results, demonstrating its effectiveness in handling exploration-exploitation trade-offs. A pseudo actor-critic approach, combining strengths of both methods, and inference-time ensembling of multiple RL networks further improved performance, achieving costs within 12% and 7.9% of the attention-based model for VRP-20.
This work highlights the potential of RL in addressing high-dimensional CVRP challenges, including generalizability and policy optimization. Future directions include integrating attention mechanisms to enhance model performance and exploring training-time ensembling to reduce variance. This research establishes RL as a scalable and efficient alternative for complex routing problems, advancing the field toward practical, real-world applications