Model Selection
This example demonstrates how to perform online model selection using a multi-armed bandit approach on the Phishing dataset.
Models and Optimizersâ
We use the following models and components:
- LogisticRegression: Two logistic regression models, each with a different SGD optimizer:
log1
usessgd1
(learning rate 0.0001)log2
usessgd2
(learning rate 0.00001)
- BanditClassifier: Combines multiple models and selects the best one online using a bandit policy.
- EpsilonGreedy: The bandit policy used to balance exploration and exploitation, with parameters for epsilon, decay, burn-in, and random seed.
We also use the following metric for evaluation:
- Accuracy
Preprocessingâ
- StandardScaler is used to standardize features for the dataset.
Beaver File Structureâ
Let's see how we would write the beaver file:
Connectorâ
We start by defining the connector, specifying the Kafka bootstrap servers and security protocol.
connector {
bootstrap_servers = "localhost:39092"
security_protocol = "plaintext"
consumer_group = 'model_selection_models'
auto_offset_reset = "earliest"
}
Models and Optimizersâ
We define the optimizers, logistic regression models, bandit policy, and the bandit classifier:
optimizer <SGD> sgd1
params:
lr= 0.0001
optimizer <SGD> sgd2
params:
lr= 1e-05
algorithm <LogisticRegression> log1
params:
optimizer =sgd1
algorithm <LogisticRegression> log2
params:
optimizer =sgd2
algorithm <EpsilonGreedy> epsilon
params:
epsilon=0.1,
decay=0.001,
burn_in=20,
seed=42
algorithm <BanditClassifier> bandit
params:
models = [log1 , log2],
metric = acc,
policy = epsilon
Preprocessingâ
We define the preprocessor:
preprocessor <StandardScaler> scaler
Metricâ
We define the evaluation metric:
metric <Accuracy> acc
Dataâ
We define the data source and specify the target feature and preprocessor:
data Phishing {
input_topic = "Phishing"
features:
target_feature = is_phishing
preprocessors = scaler
}
Pipelineâ
Finally, we define the pipeline that brings everything together:
pipeline banditPipeline {
output_topic = 'banditPipeline'
data = Phishing
algorithm = bandit
metrics = acc
}
And that's it! This setup allows you to automatically select and adapt between different models in an online fashion, optimizing performance on the Phishing dataset using a bandit-based selection strategy.
connector {
bootstrap_servers = "localhost:39092"
security_protocol = "plaintext"
consumer_group = 'model_selection_models'
auto_offset_reset = "earliest"
}
optimizer <SGD> sgd1
params:
lr= 0.0001
optimizer <SGD> sgd2
params:
lr= 1e-05
algorithm <LogisticRegression> log1
params:
optimizer =sgd1
algorithm <LogisticRegression> log2
params:
optimizer =sgd2
preprocessor <StandardScaler> scaler
metric <Accuracy> acc
algorithm <EpsilonGreedy> epsilon
params:
epsilon=0.1,
decay=0.001,
burn_in=20,
seed=42
algorithm <BanditClassifier> bandit
params:
models = [log1 , log2],
metric = acc,
policy = epsilon
data Phishing {
input_topic = "Phishing"
features:
target_feature = is_phishing
preprocessors = scaler
}
pipeline banditPipeline {
output_topic = 'banditPipeline'
data = Phishing
algorithm = bandit
metrics = acc
}