World Happiness Report
This example demonstrates how to build and evaluate multiple regression pipelines using the World Happiness Report dataset. The goal is to predict the happiness score of a country based on various features.
Data Sourceâ
- Input Topic:
World_Happiness_Report
- Target Feature:
score
(the happiness score to predict) - Dropped Feature:
overall_rank
(not used for prediction)
Preprocessingâ
- Feature Selection: Numeric (
int
,float
) and string (str
) features are selected separately. - Scaling: Numeric features are standardized using
StandardScaler
. - Encoding: String features are one-hot encoded with
OneHotEncoder
.
Algorithmsâ
Three regression algorithms are used:
- LinearRegression: A simple linear model with a specified learning rate for the intercept.
- HoeffdingAdaptiveTreeRegressor: An adaptive tree-based regressor suitable for streaming data.
- KNNRegressor: A k-nearest neighbors regressor.
Pipelinesâ
Each pipeline uses the same data and preprocessing steps but a different regression algorithm:
- linearRegPipeline: Uses
LinearRegression
. - hoeffTreePipeline: Uses
HoeffdingAdaptiveTreeRegressor
. - knnRegPipeline: Uses
KNNRegressor
.
Each pipeline outputs predictions to its own topic and evaluates performance using three metrics:
- MAE (Mean Absolute Error)
- MSE (Mean Squared Error)
- R2 (R-squared Score)
Beaver File Structureâ
Connectorâ
We start by defining the connector, specifying the Kafka bootstrap servers and security protocol.
connector {
bootstrap_servers = "localhost:39092"
security_protocol = "plaintext"
consumer_group = 'world-happiness'
auto_offset_reset = "earliest"
}
Modelsâ
We define the regression algorithms:
algorithm <LinearRegression> linearReg
params:
intercept_lr=0.1
algorithm <HoeffdingAdaptiveTreeRegressor> hoeffTree
params:
grace_period=50,
model_selector_decay=0.3,
seed=0
algorithm <KNNRegressor> knnReg
Feature Selection and Preprocessingâ
We select numeric and string features separately and apply the appropriate preprocessors:
composer <SelectType> select
params:
(int , float)
composer <SelectType> selectstr
params:
str
preprocessor <OneHotEncoder> encoder
preprocessor <StandardScaler> scaler
Metricsâ
We define the evaluation metrics:
metric <MAE> mae
metric <MSE> mse
metric <R2> r2
Dataâ
We define the data source, drop the overall_rank
feature, and specify the target and preprocessors:
data World_Happiness_Report {
input_topic = "World_Happiness_Report"
features:
drop_features = overall_rank
target_feature = score
preprocessors = select | scaler + selectstr | encoder
}
Pipelinesâ
We define three pipelines, each using a different regression algorithm:
pipeline linearRegPipeline {
output_topic = 'linearRegPipeline'
data = World_Happiness_Report
algorithm = linearReg
metrics = mae , mse , r2
}
pipeline hoeffTreePipeline {
output_topic = 'hoeffTreePipeline'
data = World_Happiness_Report
algorithm = hoeffTree
metrics = mae , mse , r2
}
pipeline knnRegPipeline {
output_topic = 'knnRegPipeline'
data = World_Happiness_Report
algorithm = knnReg
metrics = mae , mse , r2
}
With this configuration, you can efficiently compare multiple regression approaches on the World Happiness Report dataset and gain insights into which model performs best for predicting happiness scores.
connector {
bootstrap_servers = "localhost:39092"
security_protocol = "plaintext"
consumer_group = 'world-happiness'
auto_offset_reset = "earliest"
}
algorithm <LinearRegression> linearReg
params:
intercept_lr=0.1
algorithm <HoeffdingAdaptiveTreeRegressor> hoeffTree
params:
grace_period=50,
model_selector_decay=0.3,
seed=0
algorithm <KNNRegressor> knnReg
composer <SelectType> select
params:
(int , float)
composer <SelectType> selectstr
params:
str
preprocessor <OneHotEncoder> encoder
preprocessor <StandardScaler> scaler
metric <MAE> mae
metric <MSE> mse
metric <R2> r2
data World_Happiness_Report {
input_topic = "World_Happiness_Report"
features:
drop_features = overall_rank
target_feature = score
preprocessors = select | scaler + selectstr | encoder
}
pipeline linearRegPipeline {
output_topic = 'linearRegPipeline'
data = World_Happiness_Report
algorithm = linearReg
metrics = mae , mse , r2
}
pipeline hoeffTreePipeline {
output_topic = 'hoeffTreePipeline'
data = World_Happiness_Report
algorithm = hoeffTree
metrics = mae , mse , r2
}
pipeline knnRegPipeline {
output_topic = 'knnRegPipeline'
data = World_Happiness_Report
algorithm = knnReg
metrics = mae , mse , r2
}
A complete figure of Beaver Model can be seen below