core.agent#

The agent module contains implementations of DQN and PPO agents.

Since models are trained on the server, all server agents have to support the serialization of parameters into a format suitable for uploading it to the Redis data base. To reduce bandwidth, agents can choose which parameters to serialize and which to keep local.

class soulsai.core.agent.Agent(device: device = device(type='cpu'))#

Base class for agents.

All agents should inherit from this class. Agents are Modules to allow for easy serialization and deserialization of their parameters. The model ID is used to keep track of the current version of the model. The update callback can be used to reset noisy networks after an update.

update_callback()#

Update callback for networks with special requirements.

client_state_dict() dict#

Get the state dictionary of the agent.

By default, the state dictionary is the same as the state dictionary of the module. Based on the inference requirements, the state dictionary can be modified to exclude unnecessary.

Returns:

The state dictionary of the agent.

class soulsai.core.agent.DQNAgent(network_type: str, network_kwargs: dict, lr: float, gamma: float, multistep: int, grad_clip: float, q_clip: float, device: device)#

Deep Q learning agent class for training and sharing Q networks.

The agent uses a dueling Q network algorithm, where two Q networks are trained at the same time. Networks are assigned to either estimate the future state value for the TD error, or to estimate the current value. The network estimating the current value then gets updated.

train(batch: TensorDict, mask: Mask | None = None) TensorDict#

Train the agent with double DQN.

Calculates the TD error between the predictions from the trained network and the data with a Q(s+1, a) estimate from the estimation network and takes an optimization step for the train network. dqn1 and dqn2 are randomly assigned their role as estimation or train network.

Parameters:
  • batch – A TensorDict training batch containing observations, actions, rewards etc.

  • mask – Optional mask for the value network.

Returns:

The batch, optionally with additional info from the training.

update_callback()#

Reset noisy networks after an update.

class soulsai.core.agent.DistributionalDQNAgent(network_type: str, network_kwargs: dict, lr: float, gamma: float, multistep: int, grad_clip: float, q_clip: float, tau: float, device: device)#

QR DQN agent.

train(batch: TensorDict, mask: Mask | None = None) TensorDict#

Train the agent with quantile regression DQN.

Calculates the TD error between the predictions from the main network and the data with a Q(s+1, a) estimate from the target network and takes an optimization step for the train network. The action for the next state is chosen by the main network as in double DQN.

Parameters:
  • batch – A TensorDict training batch containing observations, actions, rewards etc.

  • mask – Optional mask for the value network.

Returns:

The batch, optionally with additional info from the training.

client_state_dict() dict#

Get the state dictionary of the agent.

Removes the target network, which is not needed for inference on the client.

Returns:

The state dictionary of the agent.

class soulsai.core.agent.DistributionalR2D2Agent(network_type: str, network_kwargs: dict, lr: float, gamma: float, multistep: int, grad_clip: float, q_clip: float, device: device)#

QR R2D2 agent.

The agent uses a dueling Q network algorithm, where two Q networks are trained at the same time. The networks predict a distribution of quantiles for each action instead of single values. The networks also contain a recurrent layer to allow for a latent network state that compansates potential non-markovian elements in the environment.

train(obs: np.ndarray, actions: np.ndarray, rewards: np.ndarray, next_obs: np.ndarray, terminated: np.ndarray, action_masks: np.ndarray | None = None, weights: np.ndarray | None = None) np.ndarray#

Train the agent with quantile regression DQN and a target network.

Parameters:
  • obs – Batch of observations.

  • actions – A batch of actions.

  • rewards – A batch of rewards.

  • next_obs – A batch of next observations.

  • terminated – A batch of episode termination flags.

  • action_masks – Optional batch of mask for actions.

  • weights – Optional batch of weights for prioritized experience replay.

Returns:

The TD error for each sample in the batch.

class soulsai.core.agent.PPOAgent(actor_net: str, actor_net_kwargs: dict, critic_net: str, critic_net_kwargs: dict, actor_lr: float, critic_lr: float, device: device)#

PPO agent for server-side training.

Uses a critic for general advantage estimation (see https://arxiv.org/pdf/2006.05990.pdf).

get_action(x: np.ndarray) Tuple[int, float]#

Get the action and the action probability.

Note

The probability is given as an actual probability, not a logit.

Parameters:

x – The network input.

Returns:

A tuple of the chosen action and its associated probability.

get_values(x: Tensor, requires_grad: bool = True) Tensor#

Get the state value for the input x.

Parameters:
  • x – Input tensor.

  • requires_grad – Disables the computation of gradients if true.

Returns:

The current state-action value.

get_probs(x: Tensor) Tensor#

Get the action probabilities for the input x.

Parameters:

x – Input tensor.

Returns:

The action probabilities.

update_callback()#

Update callback after a training step to reset noisy nets if used.

client_state_dict() dict#

Get the state dictionary of the agent.

By default, the state dictionary is the same as the state dictionary of the module. Based on the inference requirements, the state dictionary can be modified to exclude unnecessary.

Returns:

The state dictionary of the agent.