Reinforcement Learning (RL)
Published on: 09 October 2025
Tags: #reinforcement-learning #rl #ai
1. The Core Reinforcement Learning Loop
graph TD
subgraph Agent
Policy
Action
end
subgraph Environment
StateReward[State & Reward]
end
%% Define the correct flow of the loop
StateReward -- Observes --> Policy
Policy -- Decides --> Action
Action -- Affects --> StateReward
%% Add styling to match the original image
style Agent fill:#f9f,stroke:#333,stroke-width:2px
style Environment fill:#ccf,stroke:#333,stroke-width:2px
2. Markov Decision Process (MDP)
graph TD
%% 1. Define all nodes first
RL(Reinforcement Learning)
AC(Actor-Critic)
PB(Policy-Based)
VB(Value-Based)
DDPG(DDPG)
SAC(Soft Actor-Critic)
PG(Policy Gradient)
TRPO(TRPO)
PPO(PPO)
QL(Q-Learning)
DQN(Deep Q-Networks)
%% 2. Group algorithm nodes into their respective subgraphs
subgraph sg_ac [Actor-Critic]
DDPG
SAC
end
subgraph sg_pb [Policy-Based]
PG
TRPO
PPO
end
subgraph sg_vb [Value-Based]
QL
DQN
end
%% 3. Define all connections between the nodes
RL --> AC
RL --> PB
RL --> VB
AC --> DDPG
AC --> SAC
PB --> PG
PG --> TRPO & PPO
VB --> QL & DQN
%% 4. Define and apply styling for the subgraphs
classDef yellowBox fill:#ffffe0,stroke:#a8a8a8,stroke-width:1px;
class sg_ac,sg_pb,sg_vb yellowBox;
3. Taxonomy of RL Algorithms
graph TD
RL(Reinforcement Learning)
VB(Value-Based)
PB(Policy-Based)
AC(Actor-Critic)
RL --> VB
RL --> PB
RL --> AC
subgraph Value-Based
direction LR
QL(Q-Learning)
DQN(Deep Q-Networks)
end
subgraph Policy-Based
direction LR
PG(Policy Gradient)
TRPO(TRPO)
PPO(PPO)
end
subgraph Actor-Critic
direction LR
DDPG(DDPG)
SAC(Soft Actor-Critic)
end
VB --> QL
VB --> DQN
PB --> PG
PG --> TRPO & PPO
AC --> DDPG
AC --> SAC
4. Actor-Critic Architecture
graph LR
%% 1. Define subgraphs and their contents
subgraph sg_interaction [Interaction]
Environment
end
subgraph sg_agent [Agent]
%% No internal direction needed; it will inherit LR
Critic("Critic
Value Function Q or V")
Actor("Actor
Policy π")
end
%% 2. Define all connections between components
Environment -- State --> Actor
Actor -- Action --> Environment
Environment -- "State, Reward" --> Critic
Critic -- "TD Error / Advantage" --> Actor
%% 3. Define and apply styling for the subgraphs
classDef pinkBox fill:#fce8f5,stroke:#333,stroke-width:2px;
classDef yellowBox fill:#ffffe0,stroke:#a8a8a8,stroke-width:1px;
class sg_agent pinkBox;
class sg_interaction yellowBox;
5. Deep Q-Network (DQN) Architecture
graph LR
%% 1. Define the subgraphs and their internal nodes
subgraph sg_loop [Loop]
Env(Environment)
end
subgraph sg_agent [DQN Agent]
QNet(Q-Network)
TargetNet(Target Network)
ReplayBuffer[(Replay Buffer)]
end
%% 2. Define all connections between the components
Env -- State --> QNet
QNet -- Action --> Env
Env -- "(s, a, r, s')" --> ReplayBuffer
ReplayBuffer -- "Sample Batch" --> QNet
TargetNet -- "Provides Target Q-value" --> QNet
QNet -- "Periodically copies weights" --> TargetNet
%% 3. Define and apply styling for the subgraphs
classDef yellowBox fill:#ffffe0,stroke:#a8a8a8,stroke-width:1px;
class sg_loop,sg_agent yellowBox;
%% 4. Style the specific links based on their definition order
%% Link 4 (5th link defined) is the Target -> QNet connection
%% Link 5 (6th link defined) is the QNet -> Target connection
linkStyle 4 stroke:blue,stroke-width:2px;
linkStyle 5 stroke:red,stroke-width:2px,stroke-dasharray:5 5;
- DDPG: Deep Deterministic Policy Gradient
- TRPO: Trust Region Policy Optimization
- PPO: Proximal Policy Optimization
- TD Error: Temporal Difference Error
- DQN: Deep Q-Network