PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control

Rohit Bokade Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA 02115, USA Xiaoning Jin Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA 02115, USA
(October 23, 2024)
Abstract

Multi-Agent Reinforcement Learning (MARL) presents a promising approach for addressing the complexity of Traffic Signal Control (TSC) in urban environments. However, existing platforms for MARL-based TSC research face challenges such as slow simulation speeds and convoluted, difficult-to-maintain codebases. To address these limitations, we introduce PyTSC, a robust and flexible simulation environment that facilitates the training and evaluation of MARL algorithms for TSC. PyTSC integrates multiple simulators, such as SUMO and CityFlow, and offers a streamlined API, empowering researchers to explore a broad spectrum of MARL approaches efficiently. PyTSC accelerates experimentation and provides new opportunities for advancing intelligent traffic management systems in real-world applications.

1 Introduction

Effective Traffic Signal Control (TSC) is fundamental to urban traffic management, responsible for guiding the movement of vehicles through intersections by controlling traffic lights. The primary goals of TSC are to minimize traffic congestion, enhance traffic flow, and improve safety for both vehicles and pedestrians. Poor TSC optimization leads to increased congestion, fuel consumption, and pollution. Longer wait times at signals lead to increased fuel consumption, which not only exacerbates environmental issues through higher emissions but also results in economic losses due to delays. Moreover, inefficient TSC negatively impacts the quality of life in urban areas, contributing to increased noise and air pollution.

Multi-Agent Reinforcement Learning (MARL) offers a promising approach to tackling these TSC challenges by allowing multiple agents to collaborate within a shared environment. In fully cooperative settings, agents work toward a common goal by interacting with their environment and with one another and refine their actions based on the feedback from the environment. MARL’s versatility is demonstrated by its successful application in various domains [1]. For instance, in robotics, MARL has been used to coordinate multiple robots in tasks such as search and rescue operations [2, 3]. These successes highlight MARL’s potential to solve complex, multi-faceted problems, making it an ideal candidate for optimizing TSC in dynamic and unpredictable urban environments.

1.1 Challenges in Current MARL Research for TSC

The application of MARL to TSC has seen notable advancements [4, 5, 6, 7]. Two simulators, SUMO [8] and CityFlow [9], are widely recognized in this domain, and several open-source tools have been developed to leverage these platforms [10, 11, 12]. Recent efforts have aimed at merging the TSC environments of both simulators and unifying domain metrics, standardizing evaluation criteria and providing a consistent framework for problem formulation in TSC research [12]. Benchmarks tailored for specific datasets in SUMO have also been developed, supporting a diverse range of MARL algorithms [11].

Despite progress in applying Multi-Agent Reinforcement Learning (MARL) to Traffic Signal Control (TSC), the research community still lacks a unified, modular, and extensible platform that meets the needs of modern MARL algorithms. Most existing TSC simulators are tightly coupled to their frameworks and are not designed to support the flexible integration of advanced MARL methodologies, particularly frameworks like Centralized Training Decentralized Execution (CTDE), which have gained significant traction in recent years.

While simulators such as SUMO and CityFlow are widely used, their respective TSC libraries are neither optimized for CTDE-based architectures nor compatible with popular MARL libraries like EPyMARL [13] and MARLLib [14]. This lack of compatibility limits experimentation and the exploration of powerful frameworks that can balance centralized learning with decentralized execution, which is crucial for efficient and scalable traffic management in real-world systems.

Furthermore, the absence of a streamlined and modular design in existing tools makes them challenging to extend and adapt for new research directions. Researchers often spend considerable time grappling with integration and code maintenance, rather than focusing on the development and testing of novel MARL algorithms. This inefficiency delays progress and discourages the widespread adoption of cutting-edge MARL techniques in TSC.

To address these gaps, we propose PyTSC, a library specifically designed to overcome these limitations. PyTSC offers a clean, modular, and extensible environment that seamlessly integrates with CTDE frameworks like EPyMARL and MARLLib, enabling researchers to quickly prototype, train, and evaluate MARL algorithms in a traffic control context. By providing a robust and flexible platform, PyTSC empowers the research community to explore novel TSC solutions, leading to faster experimentation, greater reproducibility, and ultimately, more effective traffic signal control strategies.

1.2 Contributions

This work introduces PyTSC, a flexible and efficient simulation environment designed to address these challenges in the MARL-TSC research domain. PyTSC fills a crucial gap in the TSC research ecosystem by providing a platform that simplifies integration, supports cross-simulator comparisons, and offers a clean interface for deploying advanced MARL methods like CTDE. The key contributions of this research are:

  1. 1.

    Compatibility with Multiple Simulators: PyTSC supports both SUMO and CityFlow simulators, offering a consistent API. This design facilitates the integration of additional simulators in the future.

  2. 2.

    Optimized for Speed: The Retriever module in PyTSC efficiently gathers required information from the simulator after each time step, minimizing simulator queries and enhancing simulation speeds111https://sumo.dlr.de/docs/FAQ.html#traci.

  3. 3.

    Leveraging Graphical MARL Techniques: The NetworkParser module processes network files, extracting data crucial for graphical methods MARL algorithm development (e.g. adjacency matrix, centrality measures, etc.), facilitating the use of advanced graph neural networks in TSC.

  4. 4.

    Dataset Aggregation: PyTSC aggregates commonly used datasets in MARL-TSC. The environment also features modules like GridGenerator and TripGenerator for synthetic network testing and trip generation, respectively.

  5. 5.

    Unified MARL Formulation for TSC: While we recognize that we did not develop the Dec-POMDP and Networked MMDP formulations, we advocate for their use in TSC research. These formulations, which allow for decentralized control schemes, are comprehensive and well-suited for deep MARL techniques in TSC. By promoting these formulations, we aim to standardize nomenclature and problem formulations in the TSC field, facilitating clearer communication and collaboration among researchers.

  6. 6.

    Experiments with CTDE MARL Frameworks: As demo, we present a experiments with several of state-of-the-art MARL techniques, which follow Centralized Training and Decentralized Execution (CTDE) paradigm, using the EPymarl library[13].

In summary, PyTSC is not merely a tool but a significant advancement in TSC research. By integrating MARL into a well-structured environment, PyTSC has the potential to redefine Traffic Signal Control research, propelling both academic inquiry and practical applications.

2 The PyTSC Framework

Refer to caption
Figure 1: Overview of PyTSC Architecture

The PyTSC Framework stands as a meticulously designed structure, bridging the gap between traffic simulations and Multi-Agent Reinforcement Learning (MARL). The library is available at https://github.com/rbokade/pytsc.This strategic design ensures that researchers can delve into algorithmic intricacies without the overhead of integration complexities. The framework’s key attributes include:

Support for Diverse Simulator Backends:

PyTSC integrates a consistent API for two renowned simulator backends: SUMO [8] and CityFlow [9]. Simulator specific classes like ConfigParser, Retriever, Simulator, TrafficSignal allow for processing the input from the simulators into a common format which can then be used in the TrafficSignalNetwork environment class. This uniform interface not only maintains consistency across simulators but also offers a foundation for researchers to incorporate additional simulator backends as needed.

Customizable:

The modular framework of PyTSC serves as a comprehensive testbed for experimenting. Researchers can experiment with various traffic signal network settings by choosing from existing modules or extend them to suit their own needs. For example, the TLSFreePhaseSelectLogic and TLSRoundRobinPhaseSelectLogic allow users to select either adaptive phase selection or fix it to a round robin strategy. Users can also modify the information required by the MARL algorithm by simply extending BaseObservationSpace, BaseActionSpace, BaseRewardFunction according to their needs.

Optimized Performance:

The framework is optimized to gather essential metrics from the backend simulator in a single query after each simulation step. This streamlined approach minimizes redundant queries, leading to faster simulation processes. For SUMO simulator backends, it uses subscriptions https://sumo.dlr.de/docs/FAQ.html#traci to further speed up simulations.

Integration of Static Network Features:

Before simulation commencement, the NetworkParser module parses all associated network files of the chosen simulator. This provides researchers with added network insights, such as centrality metrics or adjacency matrices, which can be pivotal for enhancing MARL algorithm performance in Traffic Signal Control.

Compatibility with MARL Training Frameworks:

PyTSC introduces a well-defined API under the module TrafficSignalNetwork that integrates seamlessly with established MARL libraries, such as rllib [15] and pymarl [16], facilitating their application in TSC.

3 Experiments

3.1 Traffic Signal Network Scenarios

We have curated 10 open source scenarios most commonly used by researchers while applying MARL techniques to TSC. These scenarios, widely adopted by researchers in the field, encompass both synthetic and real-world traffic networks and are compatible with both CityFlow and SUMO simulators. A detailed overview of these scenarios is presented in Table 1. footnotetext: https://github.com/cts198859/deeprl_network/ footnotetext: https://github.com/LucasAlegre/sumo-rl footnotetext: https://github.com/traffic-signal-control/sample-code/ footnotetext: https://github.com/Pi-Star-Lab/RESCO/

Refer to caption
Refer to caption
Figure 2: 2×2222\times 22 × 2 Grid SUMO (left) CityFlow (right)
Refer to caption
Refer to caption
Figure 3: 3×3333\times 33 × 3 Grid SUMO (left) CityFlow (right)
Refer to caption
(a) Cologne (3 traffic signals)
Refer to caption
(b) Ingolstadt
Refer to caption
(c) Pasubio
Refer to caption
(d) Cologne (8 traffic signals)
Figure 4: Real-world environments for SUMO
Refer to caption
(a) Jinan 3×4343\times 43 × 4
Refer to caption
(b) Hangzhou 4×4444\times 44 × 4
Figure 5: Real-world environments for CityFlow
Simulator Scenario Network Type Agent Type Total agents
SUMO 2×2222\times 22 × 2 grid Synthetic Homogeneous 4
3×3333\times 33 × 3 grid Synthetic Homogeneous 9
Cologne 3 Real-world Heterogeneous 3
Ingolstadt 7 Real-world Heterogeneous 7
Cologne 8 Real-world Heterogeneous 8
Pasubio Real-world Heterogeneous 8
CityFlow 2×2222\times 22 × 2 grid Synthetic Homogeneous 4
3×3333\times 33 × 3 grid Synthetic Homogeneous 9
Jinan (3×4343\times 43 × 4 grid) Real-world Homogeneous 12
Hangzhou (4×4444\times 44 × 4 grid) Real-world Homogeneous 16
Table 1: Scenarios included in PyTSC

The scenarios in Table 1 offer a mix of synthetic grid networks and real-world traffic settings. Synthetic grids, from 2×2222\times 22 × 2 to 3×3333\times 33 × 3, provide a controlled environment with homogeneous agents for testing MARL techniques. In contrast, real-world scenarios from cities like Cologne, Ingolstadt, and Monaco present urban complexities with heterogeneous agents, ranging from 3333 to 16161616. This diversity ensures thorough testing of MARL across various scales. Additionally, the scenarios’ compatibility with both SUMO and CityFlow allows researchers flexibility in simulation choices. Overall, these scenarios form a robust benchmarking platform for MARL in TSC, covering diverse network types and complexities.

3.2 Traffic Signals as Agents

In MARL under fully-cooperative settings, agents learn to achieve a common goal or maximize their individual rewards through interaction with the environment and each other. In the context of TSC, MARL provides a framework for developing traffic signal control strategies that can adapt to changing traffic patterns and optimize flow. Traffic signals are modeled as agents whose goal is to choose control the traffic lights to minimize congestion throughout the traffic signal network.

3.2.1 Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs)

Dec-POMDPs model multi-agent systems [17] where agents interact in a decentralized way under limited visibility. This is pertinent for analyzing traffic signal control.

Definition A Dec-POMDP is defined as a tuple N,S,{Ai},{Oi},T,{Ωi},R𝑁𝑆subscript𝐴𝑖subscript𝑂𝑖𝑇subscriptΩ𝑖𝑅\langle N,S,\{A_{i}\},\{O_{i}\},T,\{\Omega_{i}\},R\rangle⟨ italic_N , italic_S , { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_T , { roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_R ⟩, where:

  • N𝑁Nitalic_N: A finite set of n𝑛nitalic_n agents, N{1,,n}𝑁1𝑛N\equiv\{1,\ldots,n\}italic_N ≡ { 1 , … , italic_n }.

  • S𝑆Sitalic_S: A finite set of states that describe the global state of the environment.

  • {Ai}subscript𝐴𝑖\{A_{i}\}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }: A finite set of actions for each agent i𝑖iitalic_i, with the joint action space A{A1××An}𝐴subscript𝐴1subscript𝐴𝑛A\equiv\{A_{1}\times\ldots\times A_{n}\}italic_A ≡ { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × … × italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

  • {Oi}subscript𝑂𝑖\{O_{i}\}{ italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }: A finite set of observations for each agent i𝑖iitalic_i, with the joint observation space O{O1××On}𝑂subscript𝑂1subscript𝑂𝑛O\equiv\{O_{1}\times\ldots\times O_{n}\}italic_O ≡ { italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × … × italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

  • T:S×AΔ(S):𝑇maps-to𝑆𝐴Δ𝑆T:S\times A\mapsto\Delta(S)italic_T : italic_S × italic_A ↦ roman_Δ ( italic_S ): A state transition function that maps the current state and joint action to a probability distribution over next states.

  • {Ωi:S×AΔ(Oi)}conditional-setsubscriptΩ𝑖maps-to𝑆𝐴Δsubscript𝑂𝑖\{\Omega_{i}:S\times A\mapsto\Delta(O_{i})\}{ roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_S × italic_A ↦ roman_Δ ( italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }: An observation function for each agent, mapping the current state and joint action to a probability distribution over individual observations.

  • R:S×A:𝑅maps-to𝑆𝐴R:S\times A\mapsto\mathbb{R}italic_R : italic_S × italic_A ↦ blackboard_R: A reward function that maps a global state and joint action to a real-valued reward.

By capturing the decentralized nature and partial observability inherent in urban traffic systems, Dec-POMDPs offer a foundation for designing MARL frameworks that can lead to more responsive and efficient traffic signal control.

3.2.2 Observation representation

Each traffic signal has a limited range of vision of 50 meters, within which it can obtain information related to the traffic flow. This is equivalent to the sensory information that can be obtained from practical common sensors. The observation for each traffic signal consists of: the number of vehicles {nl}l=1Lisuperscriptsubscriptsubscript𝑛𝑙𝑙1subscript𝐿𝑖\{n_{l}\}_{l=1}^{L_{i}}{ italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the average normalized speed of the vehicles {sl}l=1Lisuperscriptsubscriptsubscript𝑠𝑙𝑙1subscript𝐿𝑖\{s_{l}\}_{l=1}^{L_{i}}{ italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the number of halted vehicles (queue lengths) {ql}l=1Lisuperscriptsubscriptsubscript𝑞𝑙𝑙1subscript𝐿𝑖\{q_{l}\}_{l=1}^{L_{i}}{ italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and the current phaseID of the traffic signal, where LiLsubscript𝐿𝑖𝐿L_{i}\in Litalic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_L are the incoming lanes for a traffic signal i𝑖iitalic_i and L𝐿Litalic_L is a set of all the lanes in the network.

3.2.3 Action Representation

For each traffic signal i𝑖iitalic_i, we define its action aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as choosing one green phase from a list of available phases. A traffic signal can select any green phase from its list or keep its current one, but it must then follow the next yellow phase, which is enforced by the environment. The action selection interval and the yellow phases are fixed for a duration of 5 simulation seconds.

3.2.4 Reward

Various metrics are used for rewards in traffic signal control settings. In our study, we chose queue length qlsubscript𝑞𝑙q_{l}italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as the performance metric of the traffic signal controller due to its simplistic nature and its property of representing an instantaneous feedback signal. We define the objective function as minimizing the number of vehicles stopped throughout the network

where rtsubscript𝑟𝑡r_{t}\in\mathbb{R}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R is the global reward and lL𝑙𝐿l\in Litalic_l ∈ italic_L represents the lanes in the network.

3.3 MARL Frameworks

For benchmarking CTDE algorithms in TSC, we employ EPymarl, an extension of the widely recognized Pymarl library. EPymarl encompasses a broad spectrum of algorithms under the reinforcement learning paradigms of Q-learning, actor-critic, and policy gradient methods. Our selection focuses on the most prevalent MARL frameworks, as detailed in Table 2.

footnotetext: https://github.com/uoe-agents/epymarl/footnotetext: https://github.com/oxwhirl/pymarl
Algorithm Centralized training Off-/On-policy Value-based Policy-based
IQL
IA2C
VDN
QMIX
MAA2C
Table 2: MARL Algorithms for Benchmarking

Analyzing the algorithms presented in Table 2, we observe a diverse range of MARL techniques tailored for different problem settings. IQL, for instance, is a decentralized, value-based method that operates off-policy. In contrast, IA2C is both value and policy-based, functioning on-policy without centralized training. VDN and QMIX, while both being value-based, differ in their approach to centralized training, with QMIX employing it. MAA2C stands out as a versatile algorithm, incorporating both value and policy-based methods, operating on-policy, and utilizing centralized training. This selection ensures a comprehensive evaluation of MARL techniques across various training paradigms, policy orientations, and value determinations. By benchmarking these algorithms, we aim to provide insights into their applicability, strengths, and limitations within the context of TSC.

3.4 Evaluation Protocols

The performance of the Multi-Agent Reinforcement Learning (MARL) algorithms was evaluated through a carefully designed training and testing process. Each episode during training was constrained to 360 simulation seconds, which corresponds to 72 time steps. This time frame was selected to ensure a balance between simulation length and computational efficiency. Therefore, one simulation hour consisted of 10 episodes, which allowed the MARL algorithms to interact with the environment multiple times within a relatively short period (see Table 3 for a detailed breakdown of time steps and simulation periods).

Metric Time step Simulation seconds Simulation hours
Step 1 5 0.083
Episode limit 72 360 0.10
Training (length) 4.32M 21.6M 6000
Test (interval) 14400 7200 2
Test (length) 720 3600 1
Table 3: Breakdown of Hyperparameters Used for Benchmarking

In total, the algorithms were trained for 4.32 million time steps, which corresponds to 36,000 episodes. This extensive training schedule, equivalent to 6,000 hours of simulated traffic, provided ample opportunity for the agents to learn optimal strategies for traffic signal control. To assess the generalization capabilities of the trained models, evaluation was performed every 200 episodes. For each evaluation cycle, the models were tested over 10 episodes to ensure that performance was not merely due to overfitting but could be replicated under various conditions (see Table 3 for details on testing intervals and length).

To optimize the MARL algorithms, hyperparameter tuning was conducted on smaller synthetic grid networks, such as 2×2 and 3×3 grids. These simpler environments allowed for more efficient exploration of different hyperparameter combinations, ensuring that the best configurations were chosen before applying the algorithms to larger and more complex networks. The chosen hyperparameters for benchmarking are shown in Table 4.

Hyperparameter Value
Buffer size (episodes) 5000
Hidden dimension 64
Learning rate 0.0005
Evaluation epsilon 0.0
Epsilon anneal (steps) 50000/100000
Target update (episodes) 200
Entropy coeff. 0.01
Table 4: Hyperparameters

In addition to these learning metrics, several TSC-specific metrics were employed to capture the broader impact of the traffic signal control system. These included the total number of vehicles queued across the network, the average travel time, average occupancy, average speed, average delay, and average wait time.

All experiments were conducted on Northeastern University’s High-Performance Computing (HPC) Discovery cluster. Each experiment utilized 8 single-core CPUs, with 4 parallel environments running simultaneously to gather data efficiently. This setup allowed for significant computational power and parallelization, enabling the MARL algorithms to process multiple simulations concurrently and reduce the overall time required for training and evaluation.

4 Results

The performance of various MARL algorithms was evaluated across different environments using SUMO and CityFlow simulators. These environments ranged from simple synthetic grid networks to more complex real-world networks, such as Jinan, Hangzhou, Pasubio, and Cologne. The results highlight several key factors that affect algorithmic performance, including network topology and the simulation platform.

4.1 Performance on Synthetic Grid Networks

[Uncaptioned image]
[Uncaptioned image]

In synthetic grid networks (2x2 and 3x3), centralized algorithms like QMIX and MAA2C outperform decentralized approaches, particularly in the SUMO environment. As shown in Figure 4, MAA2C and QMIX exhibit higher rewards over time, indicating their ability to effectively manage traffic in complex scenarios where dynamic routing and multiple traffic flows are present. SUMO’s dynamic routing features allow centralized algorithms to optimize traffic signal timings more effectively, leading to reduced delays and shorter queue lengths.

Conversely, in CityFlow, where vehicle dynamics are simpler and there is no dynamic routing, the performance gap between centralized and decentralized methods narrows. All algorithms perform similarly, reflecting the reduced need for complex coordination in this simulation environment. This suggests that CityFlow, while computationally efficient, may not capture the same level of traffic dynamics that allow centralized algorithms to excel.

Table 5: Mean and Standard Error of Metrics for Various Controllers Across Scenarios
Metric MARL Controllers Rule-based Controllers
IQL IA2C VDN QMIX MAA2C Fixed Greedy Max Pres. SOTL
2×2222\times 22 × 2 SUMO
Queue 486.47 588.11 475.32 276.69 333.48 464.87 556.88 556.88 556.88
Delay 0.583 0.586 0.586 0.579 0.564 0.59 0.67 0.67 0.67
Travel Time 280.34 329.45 274.92 188.78 239.00 325.02 261.45 261.45 261.45
3×3333\times 33 × 3 SUMO
Queue 500.98 411.47 520.54 361.57 449.05 560.08 656.85 656.85 656.85
Delay 0.585 0.556 0.588 0.534 0.554 0.61 0.62 0.62 0.62
Travel Time 140.31 139.92 137.19 146.76 143.27 180.52 166.68 166.68 166.68
2×2222\times 22 × 2 CityFlow
Queue 222.72 201.71 239.34 224.97 205.05 314.08 281.75 261.41 377.43
Delay 0.606 0.600 0.611 0.605 0.598 0.6603 0.6466 0.6362 0.6993
Travel Time 223.13 191.88 239.19 227.97 209.88 325.02 219.48 203.31 437.33
3×3333\times 33 × 3 CityFlow
Queue 445.10 396.93 520.17 484.57 404.01 621.36 557.96 458.11 738.95
Delay 0.65 0.63 0.67 0.68 0.64 0.69 0.66 0.64 0.74
Travel Time 285.17 249.42 312.03 290.67 288.64 389.21 277.8 246.52 498.33

4.2 Impact of Real-World Topology on Algorithm Performance

The evaluation on real-world networks reveals important insights into how network topology influences algorithm performance.

[Uncaptioned image]
[Uncaptioned image]
Jinan and Hangzhou:

In these larger, more congested networks, centralized algorithms like QMIX and MAA2C outperform decentralized approaches such as IQL and IA2C. The results highlight the importance of centralized training across traffic signals is critical for managing complex flows of vehicles in dense urban environments. The superior performance of centralized methods in these scenarios reflects their ability to adapt traffic signals in a coordinated manner, optimizing routes and reducing overall delay.

Table 6: Mean and Standard Error of Metrics for Various Controllers Across Scenarios
Metric MARL Controllers Rule-based Controllers
IQL IA2C VDN QMIX MAA2C Fixed Greedy Max Pres. SOTL
Jinan
Queue 278.55 248.49 269.14 231.56 221.79 413.55 210.17 228.2 663.39
Delay 0.479 0.467 0.476 0.469 0.470 0.54 0.47 0.49 0.53
Travel Time 320.51 311.91 323.85 308.67 307.00 354.96 287.33 294.97 425.01
Hangzhou
Queue 286.11 203.24 234.48 208.95 211.18 301.93 171.16 177.93 354.57
Delay 0.585 0.548 0.565 0.554 0.559 0.63 0.56 0.58 0.61
Travel Time 344.72 315.13 327.71 319.03 319.11 356.65 292.77 296.73 364.87
[Uncaptioned image]
[Uncaptioned image]
Cologne (3 and 8 Traffic Signals):

The performance in the Cologne networks shows the influence of topology on algorithm effectiveness. In the simpler, 3-signal Cologne network (Figure LABEL:fig:cologne_network), decentralized algorithms perform competitively with centralized methods. The linear structure and reduced complexity of the network limit the need for extensive coordination, allowing decentralized approaches to function well. However, in the more complex 8-signal Cologne network (Figure LABEL:fig:cologne8_network), centralized methods like QMIX and MAA2C show clear advantages, managing traffic across the wider grid more effectively and leading to higher rewards and lower delays.

Ingolstadt and Pasubio:

In these irregular and sprawling networks, centralized methods outperform decentralized ones. The elongated structure of Ingolstadt (Figure LABEL:fig:ingolstadt_network) and the spread-out intersections in Pasubio (Figure LABEL:fig:pasubio_network) introduce challenges for traffic management, such as bottlenecks and uneven traffic distribution. Centralized algorithms, with their ability to coordinate across signals and optimize traffic flow holistically, significantly reduce queues and delays in these environments. In contrast, decentralized methods struggle to maintain consistent performance in such irregular topologies, resulting in higher travel times and delays.

Table 7: Mean and Standard Error of Metrics for Various Controllers Across Scenarios
Metric MARL Controllers Rule-based Controllers
IQL IA2C VDN QMIX MAA2C Fixed Greedy Max Pres. SOTL
Cologne 3
Queue 23.46 18.67 19.55 20.49 18.80 51.52 53.29 53.29 53.29
Delay 0.334 0.300 0.316 0.319 0.301 0.5 0.54 0.54 0.54
Travel Time 227.11 227.23 230.06 227.46 230.87 220.7 210.54 210.54 210.54
Cologne 8
Queue 19.95 18.32 20.77 17.35 17.39 41.71 73.87 73.87 73.87
Delay 0.190 0.185 0.184 0.171 0.183 0.26 0.3 0.3 0.3
Travel Time 356.59 356.23 356.08 356.96 356.52 358.14 339.59 339.59 339.59
Ingolstadt 7
Queue 18.45 17.46 20.35 17.38 19.19 107.06 51.8 50.56 44.16
Delay 0.184 0.169 0.190 0.173 0.177 0.42 0.37 0.36 0.33
Travel Time 192.82 193.35 193.04 193.03 192.74 174.66 183.37 184.47 186.09
Pasubio
Queue 335.29 340.11 307.60 297.79 322.94 304.82 330.26 330.26 330.26
Delay 0.35 0.33 0.30 0.29 0.30 0.34 0.39 0.39 0.39
Travel Time 303.67 310.88 322.29 321.03 327.31 326.0 292.09 292.09 292.09

4.3 Insights from Performance Metrics

Table 7 summarizes key performance metrics, such as queue lengths, delays, and travel times, across the evaluated networks. The results reveal several key trends:

  • Queue Length: Centralized algorithms generally achieve shorter queue lengths, particularly in more complex networks like Pasubio and Cologne 8, where coordination across intersections is critical for maintaining traffic flow.

  • Delays: Across all scenarios, centralized approaches reduce delays more effectively than decentralized ones. The difference is especially notable in SUMO environments, where dynamic routing allows centralized algorithms to adapt to real-time traffic conditions more effectively.

  • Travel Time: In networks with higher complexity, such as Jinan and Hangzhou, centralized approaches minimize travel time more effectively than decentralized methods. However, in simpler networks like CityFlow’s 2x2 grid, the travel times across all algorithms are comparable, suggesting that centralized coordination offers diminishing returns in less complex environments.

4.4 Topology-Dependent Performance

The topology of the traffic network plays a crucial role in determining the success of different MARL algorithms. In simpler, grid-like networks, such as the 3-signal Cologne setup, decentralized algorithms can perform on par with centralized methods. However, as network complexity increases, the advantages of centralized coordination become more pronounced. In networks like Pasubio and Ingolstadt, which feature irregular layouts and complex traffic flows, centralized approaches like QMIX and MAA2C outperform decentralized methods by a significant margin.

This suggests that the choice of algorithm should be informed by the specific characteristics of the traffic network. Centralized algorithms are more suitable for complex, irregular networks with high traffic density, while decentralized approaches may suffice in simpler, more uniform networks.

5 Conclusion

In this work, we introduced PyTSC, a versatile and extensible library designed to address the significant gaps in MARL-based Traffic Signal Control (TSC) research. By enabling compatibility with multiple simulators, such as SUMO and CityFlow, PyTSC provides a unified API that simplifies the development, testing, and benchmarking of Multi-Agent Reinforcement Learning (MARL) algorithms for TSC. Additionally, its optimized architecture facilitates faster simulation speeds, allowing researchers to focus on algorithmic innovations rather than technical integration.

PyTSC’s integration with modern CTDE (Centralized Training Decentralized Execution) frameworks, such as EPyMARL and MARLLib, allows researchers to prototype and evaluate advanced MARL techniques within a traffic control setting. This fills a critical gap in the existing research ecosystem, where tools are either too domain-specific or lack the flexibility required for seamless experimentation.

With its modular and extensible design, PyTSC establishes a foundation for more consistent, reproducible, and scalable MARL research in the realm of traffic signal control. This contributes directly to advancing smarter, more adaptive traffic management solutions. The performance evaluations across both synthetic and real-world traffic networks demonstrate the library’s effectiveness and versatility, underscoring its potential to drive innovation in TSC systems.

6 Future Work

Looking ahead, PyTSC opens several avenues for future development and exploration:

  • Incorporation of Additional MARL Algorithms: Future iterations of PyTSC will include policy-based MARL algorithms such as MADDPG, IPPO, and MAPPO, expanding its capabilities for more complex traffic control strategies.

  • Diverse Traffic Flow Generation: We aim to introduce more diverse traffic flow generation models, including both synthetic and real-world flow patterns, to better simulate the variety of traffic scenarios seen in urban environments.

  • Synthetic Network Generation Modules: Expanding the tools for generating synthetic traffic signal networks will enable researchers to test MARL algorithms in even more controlled and complex environments, beyond the existing benchmarks.

  • Tailored Neural Network Architectures: While the current version of PyTSC relies on out-of-the-box neural architectures, future updates will explore the integration of specialized models, such as Graph Neural Networks (GNNs), that can leverage the graphical structure of traffic networks to optimize decision-making.

  • Extending Simulator Compatibility: We also plan to integrate additional traffic simulators into the PyTSC framework, ensuring that the platform remains adaptable to emerging technologies and research needs in the field of intelligent transportation systems.

By continuously evolving PyTSC, we aim to provide a robust and dynamic platform that will foster innovation in MARL-based traffic signal control and further enhance the real-world applicability of these technologies.

References

  • [1] W. Du and S. Ding, “A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications,” Artificial Intelligence Review, vol. 54, pp. 3215–3238, 2021.
  • [2] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE transactions on cybernetics, vol. 50, no. 9, pp. 3826–3839, 2020.
  • [3] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Re, and S. Spanò, “Multi-agent reinforcement learning: A review of challenges and applications,” Applied Sciences, vol. 11, no. 11, p. 4948, 2021.
  • [4] M. Noaeen, A. Naik, L. Goodman, J. Crebo, T. Abrar, Z. S. H. Abad, A. L. Bazzan, and B. Far, “Reinforcement learning in urban network traffic signal control: A systematic literature review,” Expert Systems with Applications, vol. 199, p. 116830, 2022.
  • [5] R. Chen, F. Fang, and N. Sadeh, “The real deal: A review of challenges and opportunities in moving reinforcement learning-based traffic signal control systems towards reality,” arXiv preprint arXiv:2206.11996, 2022.
  • [6] H. Wei, G. Zheng, V. Gayah, and Z. Li, “A survey on traffic signal control methods,” arXiv preprint arXiv:1904.08117, 2019.
  • [7] A. Haydari and Y. Yılmaz, “Deep reinforcement learning for intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 11–32, 2020.
  • [8] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner, “Microscopic traffic simulation using sumo,” in The 21st IEEE International Conference on Intelligent Transportation Systems, IEEE, 2018.
  • [9] H. Zhang, S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li, “Cityflow: A multi-agent reinforcement learning environment for large scale city traffic scenario,” in The world wide web conference, pp. 3620–3624, 2019.
  • [10] L. N. Alegre, “SUMO-RL.” https://github.com/LucasAlegre/sumo-rl, 2019.
  • [11] J. Ault and G. Sharon, “Reinforcement learning benchmarks for traffic signal control,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  • [12] H. Mei, X. Lei, L. Da, B. Shi, and H. Wei, “Libsignal: An open library for traffic signal control,” arXiv preprint arXiv:2211.10649, 2022.
  • [13] G. Papoudakis, F. Christianos, L. Schäfer, and S. V. Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” arXiv preprint arXiv:2006.07869, 2020.
  • [14] S. Hu, Y. Zhong, M. Gao, W. Wang, H. Dong, Z. Li, X. Liang, Y. Yang, and X. Chang, “Marllib: Extending rllib for multi-agent reinforcement learning,” 2022.
  • [15] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica, “RLlib: Abstractions for distributed reinforcement learning,” in International Conference on Machine Learning (ICML), 2018.
  • [16] M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr, J. Foerster, and S. Whiteson, “The starcraft multi-agent challenge,” arXiv preprint arXiv:1902.04043, 2019.
  • [17] F. A. Oliehoek and C. Amato, A concise introduction to decentralized POMDPs. Springer International Publishing, Cham, 2016.