MABWiser Public API

base_mab

This module defines the abstract base class for contextual multi-armed bandit algorithms.

class mabwiser.base_mab.BaseMAB(rng: mabwiser.utils._NumpyRNG, arms: List[NewType.<locals>.new_type], n_jobs: int, backend: str = None)

Bases: object

Abstract base class for multi-armed bandits.

This module is not intended to be used directly, instead it declares the basic skeleton of multi-armed bandits together with a set of parameters that are common to every bandit algorithm.

It declares abstract methods that sub-classes can override to implement specific bandit policies using:

  • __init__ constructor to initialize the bandit

  • add_arm method to add a new arm

  • fit method for training

  • partial_fit method for _online learning

  • predict_expectations method to retrieve the expectation of each arm

  • predict method for testing to retrieve the best arm based on the policy

To ensure this is the case, alpha and l2_lambda are required to be greater than zero.

rng

The random number generator.

Type

np.random.RandomState

arms

The list of all arms.

Type

List

n_jobs

This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.

Type

int

backend

Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes. - “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky. - “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the

called function relies a lot on Python objects.

Default value is None. In this case the default backend selected by joblib will be used.

Type

str, optional

arm_to_expectation

The dictionary of arms (keys) to their expected rewards (values).

Type

Dict[Arm, floot]

add_arm(arm: NewType.<locals>.new_type, binarizer: Callable = None, scaler: Callable = None) → NoReturn

Introduces a new arm to the bandit.

Adds the new arm with zero expectations and calls the _uptake_new_arm() function of the sub-class.

abstract fit(decisions: numpy.ndarray, rewards: numpy.ndarray, contexts: Optional[numpy.ndarray] = None) → NoReturn

Abstract method.

Fits the multi-armed bandit to the given decision and reward history and corresponding contexts if any.

abstract partial_fit(decisions: numpy.ndarray, rewards: numpy.ndarray, contexts: Optional[numpy.ndarray] = None) → NoReturn

Abstract method.

Updates the multi-armed bandit with the given decision and reward history and corresponding contexts if any.

abstract predict(contexts: Optional[numpy.ndarray] = None) → NewType.<locals>.new_type

Abstract method.

Returns the predicted arm.

abstract predict_expectations(contexts: Optional[numpy.ndarray] = None) → Dict[NewType.<locals>.new_type, Union[int, float]]

Abstract method.

Returns a dictionary from arms (keys) to their expected rewards (values).

mab

This module defines the public interface of the MABWiser Library providing access to the following modules:

  • MAB

  • LearningPolicy

  • NeighborhoodPolicy

class mabwiser.mab.LearningPolicy

Bases: tuple

class EpsilonGreedy(epsilon: Union[int, float] = 0.05)

Bases: tuple

Epsilon Greedy Learning Policy.

This policy selects the arm with the highest expected reward with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.

epsilon

The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.05.

Type

Num

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456)
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm1'
property epsilon

Alias for field number 0

class LinTS(alpha: Union[int, float] = 1.0, l2_lambda: Union[int, float] = 1.0, arm_to_scaler: Dict[NewType.<locals>.new_type, Callable] = None)

Bases: tuple

LinTS Learning Policy

For each arm LinTS trains a ridge regression and creates a multivariate normal distribution for the coefficients using the calculated coefficients as the mean and the covariance as:

\[\alpha^{2} (x_i^{T}x_i + \lambda * I_d)^{-1}\]

The normal distribution is randomly sampled to obtain expected coefficients for the ridge regression for each prediction.

\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.

The multivariate normal distribution uses Cholesky decomposition to guarantee deterministic behavior. This method requires that the covariance is a positive definite matrix. To ensure this is the case, alpha and l2_lambda are required to be greater than zero.

alpha

The multiplier to determine the degree of exploration. Integer or float. Must be greater than zero. Default value is 1.0.

Type

Num

l2_lambda

The regularization strength. Integer or float. Must be greater than zero. Default value is 1.0.

Type

Num

arm_to_scaler

Standardize context features by arm. Dictionary mapping each arm to a scaler object. It is assumed that the scaler objects are already fit and will only be used to transform context features. Default value is None.

Type

Dict[Arm, Callable]

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.LinTS(alpha=0.25))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[3, 2, 0, 1]])
'Arm2'
property alpha

Alias for field number 0

property arm_to_scaler

Alias for field number 2

property l2_lambda

Alias for field number 1

class LinUCB(alpha: Union[int, float] = 1.0, l2_lambda: Union[int, float] = 1.0, arm_to_scaler: Dict[NewType.<locals>.new_type, Callable] = None)

Bases: tuple

LinUCB Learning Policy.

This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value and calculates the upper confidence bound of that prediction. The arm with the highest highest upper bound is selected.

The UCB for each arm is calculated as:

\[UCB = x_i \beta + \alpha \sqrt{(x_i^{T}x_i + \lambda * I_d)^{-1}x_i}\]

Where \(\beta\) is the matrix of the ridge regression coefficients, \(\lambda\) is the regularization strength, and I_d is a dxd identity matrix where d is the number of features in the context data.

\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.

alpha

The parameter to control the exploration. Integer or float. Cannot be negative. Default value is 1.0.

Type

Num

l2_lambda

The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.

Type

Num

arm_to_scaler

Standardize context features by arm. Dictionary mapping each arm to a scaler object. It is assumed that the scaler objects are already fit and will only be used to transform context features. Default value is None.

Type

Dict[Arm, Callable]

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.LinUCB(alpha=1.25))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[3, 2, 0, 1]])
'Arm2'
property alpha

Alias for field number 0

property arm_to_scaler

Alias for field number 2

property l2_lambda

Alias for field number 1

class Popularity

Bases: tuple

Randomized Popularity Learning Policy.

Returns a randomized popular arm for each prediction. The probability of selection for each arm is weighted by their mean reward. It assumes that the rewards are non-negative.

The probability of selection is calculated as:

\[P(arm) = \frac{ \mu_i } { \Sigma{ \mu } }\]

where \(\mu_i\) is the mean reward for that arm.

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.Popularity())
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm1'
class Random

Bases: tuple

Random Learning Policy.

Returns a random arm for each prediction. The probability of selection for each arm is uniformly at random.

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.Random())
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'
class Softmax(tau: Union[int, float] = 1)

Bases: tuple

Softmax Learning Policy.

This policy selects each arm with a probability proportionate to its average reward. The average reward is calculated as a logistic function with each probability as:

\[P(arm) = \frac{ e ^ \frac{\mu_i - \max{\mu}}{ \tau } } { \Sigma{e ^ \frac{\mu - \max{\mu}}{ \tau }} }\]

where \(\mu_i\) is the mean reward for that arm and \(\tau\) is the “temperature” to determine the degree of exploration.

tau

The temperature to control the exploration. Integer or float. Must be greater than zero. Default value is 1.

Type

Num

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.Softmax(tau=1))
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'
property tau

Alias for field number 0

class ThompsonSampling(binarizer: Callable = None)

Bases: tuple

Thompson Sampling Learning Policy.

This policy creates a beta distribution for each arm and then randomly samples from these distributions. The arm with the highest sample value is selected.

Notice that rewards must be binary to create beta distributions. If rewards are not binary, see the binarizer function.

binarizer

If rewards are not binary, a binarizer function is required. Given an arm decision and its corresponding reward, the binarizer function returns True/False or 0/1 to denote whether the decision counts as a success, i.e., True/1 based on the reward or False/0 otherwise.

The function signature of the binarizer is:

binarize(arm: Arm, reward: Num) -> True/False or 0/1

Type

Callable

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [1, 1, 1, 0]
>>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling())
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'
>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> arm_to_threshold = {'Arm1':10, 'Arm2':10}
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [10, 20, 15, 7]
>>> def binarize(arm, reward): return reward > arm_to_threshold[arm]
>>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling(binarizer=binarize))
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'
property binarizer

Alias for field number 0

class UCB1(alpha: Union[int, float] = 1)

Bases: tuple

Upper Confidence Bound1 Learning Policy.

This policy calculates an upper confidence bound for the mean reward of each arm. It greedily selects the arm with the highest upper confidence bound.

The UCB for each arm is calculated as:

\[UCB = \mu_i + \alpha \times \sqrt[]{\frac{2 \times log(N)}{n_i}}\]

Where \(\mu_i\) is the mean for that arm, \(N\) is the total number of trials, and \(n_i\) is the number of times the arm has been selected.

\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.

alpha

The parameter to control the exploration. Integer of float. Cannot be negative. Default value is 1.

Type

Num

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.UCB1(alpha=1.25))
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'
property alpha

Alias for field number 0

class mabwiser.mab.MAB(arms: List[NewType.<locals>.new_type], learning_policy: Union[mabwiser.mab.EpsilonGreedy, mabwiser.mab.Popularity, mabwiser.mab.Random, mabwiser.mab.Softmax, mabwiser.mab.ThompsonSampling, mabwiser.mab.UCB1, mabwiser.mab.LinTS, mabwiser.mab.LinUCB], neighborhood_policy: Union[None, mabwiser.mab.LSHNearest, mabwiser.mab.Clusters, mabwiser.mab.KNearest, mabwiser.mab.Radius] = None, seed: int = 123456, n_jobs: int = 1, backend: str = None)

Bases: object

MABWiser: Contextual Multi-Armed Bandit Library

MABWiser is a research library for fast prototyping of multi-armed bandit algorithms. It supports context-free, parametric and non-parametric contextual bandit models.

arms

The list of all of the arms available for decisions. Arms can be integers, strings, etc.

Type

list

learning_policy

The learning policy.

Type

LearningPolicy

neighborhood_policy

The neighborhood policy.

Type

NeighborhoodPolicy

is_contextual

True if contextual policy is given, false otherwise. This is a read-only data field.

Type

bool

seed

The random seed to initialize the internal random number generator. This is a read-only data field.

Type

numbers.Rational

n_jobs

This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.

Type

int

backend

Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and

output data with the worker Python processes.

  • “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.

  • “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.

Default value is None. In this case the default backend selected by joblib will be used.

Type

str, optional

Examples

>>> from mabwiser.mab import MAB, LearningPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456)
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm1'
>>> mab.add_arm('Arm3')
>>> mab.partial_fit(['Arm3'], [30])
>>> mab.predict()
'Arm3'
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1', 'Arm2']
>>> rewards = [20, 17, 25, 9, 11]
>>> contexts = [[0, 0, 0], [1, 0, 1], [0, 1, 1], [0, 0, 0], [1, 1, 1]]
>>> contextual_mab = MAB(arms, LearningPolicy.EpsilonGreedy(), NeighborhoodPolicy.KNearest(k=3))
>>> contextual_mab.fit(decisions, rewards, contexts)
>>> contextual_mab.predict([[1, 1, 0], [1, 1, 1], [0, 1, 0]])
['Arm2', 'Arm2', 'Arm2']
>>> contextual_mab.add_arm('Arm3')
>>> contextual_mab.partial_fit(['Arm3'], [30], [[1, 1, 1]])
>>> contextual_mab.predict([[1, 1, 1]])
'Arm3'
add_arm(arm: NewType.<locals>.new_type, binarizer: Callable = None, scaler: Callable = None) → NoReturn

Adds an _arm_ to the list of arms.

Incorporates the arm into the learning and neighborhood policies with no training data.

Parameters
  • arm (Arm) – The new arm to be added.

  • binarizer (Callable) – The new binarizer function for Thompson Sampling.

  • scaler (Callable) – A scaler object from sklearn.preprocessing.

Returns

Return type

No return.

Raises
  • TypeError – For ThompsonSampling, binarizer must be a callable function.:

  • TypeError – The standard scaler object must have a transform method.:

  • TypeError – The standard scaler object must be fit with calculated mean_ and var_ attributes.:

  • ValueError – A binarizer function was provided but the learning policy is not Thompson Sampling.:

  • ValueError – The arm already exists.:

  • ValueError – The arm is None.:

  • ValueError – The arm is NaN.:

  • ValueError – The arm is Infinity.:

fit(decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → NoReturn

Fits the multi-armed bandit to the given decisions, their corresponding rewards and contexts, if any.

Validates arguments and raises exceptions in case there are violations.

This function makes the following assumptions:
  • each decision corresponds to an arm of the bandit.

  • there are no None, Nan, or Infinity values in the contexts.

Parameters
  • decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.

  • rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.

  • contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is None, i.e., no contexts.

Returns

Return type

No return.

Raises
  • TypeError – Decisions and rewards are not given as list, numpy array or pandas series.:

  • TypeError – Contexts is not given as None, list, numpy array, pandas series or data frames.:

  • ValueError – Length mismatch between decisions, rewards, and contexts.:

  • ValueError – Fitting contexts data when there is no contextual policy.:

  • ValueError – Contextual policy when fitting no contexts data.:

  • ValueError – Rewards contain None, Nan, or Infinity.:

property learning_policy

Creates named tuple of the learning policy based on the implementor.

Returns

Return type

The learning policy.

Raises

NotImplementedError – MAB learning_policy property not implemented for this learning policy.:

property neighborhood_policy

Creates named tuple of the neighborhood policy based on the implementor.

Returns

Return type

The neighborhood policy

partial_fit(decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → NoReturn

Updates the multi-armed bandit with the given decisions, their corresponding rewards and contexts, if any.

Validates arguments and raises exceptions in case there are violations.

This function makes the following assumptions:
  • each decision corresponds to an arm of the bandit.

  • there are no None, Nan, or Infinity values in the contexts.

Parameters
  • decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.

  • rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.

  • contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame] =) – The context under which each decision is made. Default value is None, i.e., no contexts.

Returns

Return type

No return.

Raises
  • TypeError – Decisions, rewards are not given as list, numpy array or pandas series.:

  • TypeError – Contexts is not given as None, list, numpy array, pandas series or data frames.:

  • ValueError – Length mismatch between decisions, rewards, and contexts.:

  • ValueError – Fitting contexts data when there is no contextual policy.:

  • ValueError – Contextual policy when fitting no contexts data.:

  • ValueError – Rewards contain None, Nan, or Infinity:

predict(contexts: Union[None, List[Union[int, float]], List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → Union[NewType.<locals>.new_type, List[NewType.<locals>.new_type]]

Returns the “best” arm (or arms list if multiple contexts are given) based on the expected reward.

The definition of the best depends on the specified learning policy. Contextual learning policies and neighborhood policies require contexts data in training. In testing, they return the best arm given new context(s).

Parameters

contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is None. Contexts should be None for context-free bandits and is required for contextual bandits.

Returns

Return type

The recommended arm or recommended arms list.

Raises
  • TypeError – Contexts is not given as None, list, numpy array, pandas series or data frames.:

  • ValueError – Predicting with contexts data when there is no contextual policy.:

  • ValueError – Contextual policy when predicting with no contexts data.:

predict_expectations(contexts: Union[None, List[Union[int, float]], List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None) → Union[Dict[NewType.<locals>.new_type, Union[int, float]], List[Dict[NewType.<locals>.new_type, Union[int, float]]]]

Returns a dictionary of arms (key) to their expected rewards (value).

Contextual learning policies and neighborhood policies require contexts data for expected rewards.

Parameters

contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. Contexts should be None for context-free bandits and is required for contextual bandits.

Returns

Return type

The dictionary of arms (key) to their expected rewards (value), or a list of such dictionaries.

Raises
  • TypeError – Contexts is not given as None, list, numpy array or pandas data frames.:

  • ValueError – Predicting with contexts data when there is no contextual policy.:

  • ValueError – Contextual policy when predicting with no contexts data.:

class mabwiser.mab.NeighborhoodPolicy

Bases: tuple

class Clusters(n_clusters: Union[int, float] = 2, is_minibatch: bool = False)

Bases: tuple

Clusters Neighborhood Policy.

Clusters is a k-means clustering approach that uses the observations from the closest cluster with a learning policy. Supports KMeans and MiniBatchKMeans.

n_clusters

The number of clusters. Integer. Must be at least 2. Default value is 2.

Type

Num

is_minibatch

Boolean flag to use MiniBatchKMeans or not. Default value is False.

Type

bool

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Clusters(3))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[3, 1]
property is_minibatch

Alias for field number 1

property n_clusters

Alias for field number 0

class KNearest(k: int = 1, metric: str = 'euclidean')

Bases: tuple

KNearest Neighborhood Policy.

KNearest is a nearest neighbors approach that selects the k-nearest observations to be used with a learning policy.

k

The number of neighbors to select. Integer value. Must be greater than zero. Default value is 1.

Type

int

metric

The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.

Type

str

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0),                           NeighborhoodPolicy.KNearest(2, "euclidean"))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[1, 1]
property k

Alias for field number 0

property metric

Alias for field number 1

class LSHNearest(n_dimensions: int = 5, n_tables: int = 3, no_nhood_prob_of_arm: Optional[List] = None)

Bases: tuple

Locality-Sensitive Hashing Approximate Nearest Neighbors Policy.

LSHNearest is a nearest neighbors approach that uses locality sensitive hashing with a simhash to select observations to be used with a learning policy.

For the simhash, contexts are projected onto a hyperplane of n_context_cols x n_dimensions and each column of the hyperplane is evaluated for its sign, giving an ordered array of binary values. This is converted to a base 10 integer used as the hash code to assign the context to a hash table. This process is repeated for a specified number of hash tables, where each has a unique, randomly-generated hyperplane. To select the neighbors for a context, the hash code is calculated for each hash table and any contexts with the same hashes are selected as the neighbors.

As with the radius or k value for other nearest neighbors algorithms, selecting the best number of dimensions and tables requires tuning. For the dimensions, a good starting point is to use the log of the square root of the number of rows in the training data. This will give you sqrt(n_rows) number of hashes.

The number of dimensions and number of tables have inverse effects from each other on the number of empty neighborhoods and average neighborhood size. Increasing the dimensionality decreases the number of collisions, which increases the precision of the approximate neighborhood but also potentially increases the number of empty neighborhoods. Increasing the number of hash tables increases the likelihood of capturing neighbors the other random hyperplanes miss and increases the average neighborhood size. It should be noted that the fit operation is O(2**n_dimensions).

n_dimensions

The number of dimensions to use for the hyperplane. Integer value. Must be greater than zero. Default value is 5.

Type

int

n_tables

The number of hash tables. Integer value. Must be greater than zero. Default value is 3.

Type

int

no_nhood_prob_of_arm

The probabilities associated with each arm. Used to select random arm if a prediction context has no neighbors. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.

Type

None or List

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0),                           NeighborhoodPolicy.LSHNearest(5, 3))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[3, 1]
property n_dimensions

Alias for field number 0

property n_tables

Alias for field number 1

property no_nhood_prob_of_arm

Alias for field number 2

class Radius(radius: Union[int, float] = 0.05, metric: str = 'euclidean', no_nhood_prob_of_arm: Optional[List] = None)

Bases: tuple

Radius Neighborhood Policy.

Radius is a nearest neighborhood approach that selects the observations within a given radius to be used with a learning policy.

radius

The maximum distance within which to select observations. Integer or Float. Must be greater than zero. Default value is 1.

Type

Num

metric

The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.

Type

str

no_nhood_prob_of_arm

The probabilities associated with each arm. Used to select random arm if a prediction context has no neighbors. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.

Type

None or List

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0),                           NeighborhoodPolicy.Radius(2, "euclidean"))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[3, 1]
property metric

Alias for field number 1

property no_nhood_prob_of_arm

Alias for field number 2

property radius

Alias for field number 0

simulator

This module provides a simulation utility for comparing algorithms and hyper-parameter tuning.

class mabwiser.simulator.Simulator(bandits: List[tuple], decisions: Union[List[NewType.<locals>.new_type], numpy.ndarray, pandas.core.series.Series], rewards: Union[List[Union[int, float]], numpy.ndarray, pandas.core.series.Series], contexts: Union[None, List[List[Union[int, float]]], numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame] = None, scaler: callable = None, test_size: float = 0.3, is_ordered: bool = False, batch_size: int = 0, evaluator: callable = <function default_evaluator>, seed: int = 123456, is_quick: bool = False, log_file: str = None, log_format: str = '%(asctime)s %(levelname)s %(message)s')

Bases: object

Multi-Armed Bandit Simulator.

This utility runs a simulation using historic data and a collection of multi-armed bandits from the MABWiser library or that extends the BaseMAB class in MABWiser.

It can be used to run a simple simulation with a single bandit or to compare multiple bandits for policy selection, hyper-parameter tuning, etc.

Nearest Neighbor bandits that use the default Radius and KNearest implementations from MABWiser are converted to custom versions that share distance calculations to speed up the simulation. These custom versions also track statistics about the neighborhoods that can be used in evaluation.

The results can be accessed as the arms_to_stats, model_to_predictions, model_to_confusion_matrices, and models_to_evaluations properties.

When using partial fitting, an additional confusion matrix is calculated for all predictions after all of the batches are processed.

A log of the simulation tracks the experiment progress.

bandits

A list of tuples of the name of each bandit and the bandit object.

Type

list[(str, bandit)]

decisions

The complete decision history to be used in train and test.

Type

array

rewards

The complete array history to be used in train and test.

Type

array

contexts

The complete context history to be used in train and test.

Type

array

scaler

A scaler object from sklearn.preprocessing.

Type

scaler

test_size

The size of the test set

Type

float

is_ordered

Whether to use a chronological division for the train-test split. If false, uses sklearn’s train_test_split.

Type

bool

batch_size

The size of each batch for online learning.

Type

int

evaluator

The function for evaluating the bandits. Values are stored in bandit_to_arm_to_stats_avg. Must have the function signature function(arms_to_stats_train: dictionary, predictions: list, decisions: np.ndarray, rewards: np.ndarray, metric: str).

Type

callable

is_quick

Flag to skip neighborhood statistics.

Type

bool

logger

The logger object.

Type

Logger

arms

The list of arms used by the bandits.

Type

list

arm_to_stats_total

Descriptive statistics for the complete data set.

Type

dict

arm_to_stats_train

Descriptive statistics for the training data.

Type

dict

arm_to_stats_test

Descriptive statistics for the test data.

Type

dict

bandit_to_arm_to_stats_avg

Descriptive statistics for the predictions made by each bandit based on means from training data.

Type

dict

bandit_to_arm_to_stats_min

Descriptive statistics for the predictions made by each bandit based on minimums from training data.

Type

dict

bandit_to_arm_to_stats_max

Descriptive statistics for the predictions made by each bandit based on maximums from training data.

Type

dict

bandit_to_confusion_matrices

The confusion matrices for each bandit.

Type

dict

bandit_to_predictions

The prediction for each item in the test set for each bandit.

Type

dict

bandit_to_expectations

The arm_to_expectations for each item in the test set for each bandit. For context-free bandits, there is a single dictionary for each batch.

Type

dict

bandit_to_neighborhood_size

The number of neighbors in each neighborhood for each row in the test set. Calculated when using a Radius neighborhood policy, or a custom class that inherits from it. Not calculated when is_quick is True.

Type

dict

bandit_to_arm_to_stats_neighborhoods

The arm_to_stats for each neighborhood for each row in the test set. Calculated when using Radius or KNearest, or a custom class that inherits from one of them. Not calculated when is_quick is True.

Type

dict

test_indices

The indices of the rows in the test set. If input was not zero-indexed, these will reflect their position in the input rather than actual index.

Type

list

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab1 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456)
>>> mab2 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.30), seed=123456)
>>> bandits = [('EG 25%', mab1), ('EG 30%', mab2)]
>>> offline_sim = Simulator(bandits, decisions, rewards, test_size=0.5, batch_size=0)
>>> offline_sim.run()
>>> offline_sim.bandit_to_arm_to_stats_avg['EG 30%']['Arm1']
{'count': 1, 'sum': 9, 'min': 9, 'max': 9, 'mean': 9.0, 'std': 0.0}
get_arm_stats(decisions: numpy.ndarray, rewards: numpy.ndarray) → dict

Calculates descriptive statistics for each arm in the provided data set.

Parameters
  • decisions (np.ndarray) – The decisions to filter the rewards.

  • rewards (np.ndarray) – The rewards to get statistics about.

Returns

  • Arm_to_stats dictionary.

  • Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}

static get_stats(rewards: numpy.ndarray) → dict

Calculates descriptive statistics for the given array of rewards.

Parameters

rewards (nd.nparray) – Array of rewards for a single arm.

Returns

  • A dictionary of descriptive statistics.

  • Dictionary has the format {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}

plot(metric: str = 'avg', is_per_arm: bool = False) → NoReturn

Generates a plot of the cumulative sum of the rewards for each bandit. Simulation must be run before calling this method.

Parameters
  • metric (str) – The bandit_to_arm_to_stats to use to generate the plot. Must be ‘avg’, ‘min’, or ‘max

  • is_per_arm (bool) – Whether to plot each arm separately or use an aggregate statistic.

Raises
  • AssertionError Descriptive statics for predictions are missing.

  • TypeError Metric must be a string.

  • TypeError The per_arm flag must be a boolean.

  • ValueError The metric must be one of avg, min or max.

Returns

Return type

None

run() → NoReturn

Run simulator

Runs a simulation concurrently for all bandits in the bandits list.

Returns

Return type

None

mabwiser.simulator.default_evaluator(arms: List[NewType.<locals>.new_type], decisions: numpy.ndarray, rewards: numpy.ndarray, predictions: List[NewType.<locals>.new_type], arm_to_stats: dict, stat: str, start_index: int, nn: bool = False) → dict

Default evaluation function.

Calculates predicted rewards for the test batch based on predicted arms. When the predicted arm is the same as the historic decision, the historic reward is used. When the predicted arm is different, the mean, min or max reward from the training data is used. If using Radius or KNearest neighborhood policy, the statistics from the neighborhood are used instead of the entire training set.

The simulator supports custom evaluation functions, but they must have this signature to work with the simulation pipeline.

Parameters
  • arms (list) – The list of arms.

  • decisions (np.ndarray) – The historic decisions for the batch being evaluated.

  • rewards (np.ndarray) – The historic rewards for the batch being evaluated.

  • predictions (list) – The predictions for the batch being evaluated.

  • arm_to_stats (dict) – The dictionary of descriptive statistics for each arm to use in evaluation.

  • stat (str) – Which metric from arm_to_stats to use. Takes the values ‘min’, ‘max’, ‘mean’.

  • start_index (int) – The index of the first row in the batch. For offline simulations it is 0. For _online simulations it is batch size * batch number. Used to select the correct index from arm_to_stats if there are separate entries for each row in the test set.

  • nn (bool) – Whether the results are from one of the simulator custom nearest neighbors implementations.

Returns

  • An arm_to_stats dictionary for the predictions in the batch.

  • Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}

utils

This module provides a number of constants and helper functions.

mabwiser.utils.Arm(x)

Arm type is defined as integer, float, or string.

class mabwiser.utils.Constants

Bases: tuple

Constant values used by the modules.

default_seed = 123456

The default random seed.

distance_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean']

The distance metrics supported by neighborhood policies.

mabwiser.utils.Num

Num type is defined as integer or float.

alias of Union[int, float]

mabwiser.utils.argmax(dictionary: Dict[NewType.<locals>.new_type, Union[int, float]]) → NewType.<locals>.new_type

Returns the first key with the maximum value.

mabwiser.utils.check_false(expression: bool, exception: Exception) → NoReturn

Checks that given expression is false, otherwise raises the given exception.

mabwiser.utils.check_true(expression: bool, exception: Exception) → NoReturn

Checks that given expression is true, otherwise raises the given exception.

mabwiser.utils.create_rng(seed: int) → mabwiser.utils._BaseRNG

Returns an rng object

Parameters

seed (int) – the seed of the rng

Returns

out – An rng object that implements the base rng class

Return type

_BaseRNG

mabwiser.utils.reset(dictionary: Dict, value) → NoReturn

Maps every key to the given value.