Heart Failure

This example works on the Heart Failure Dataset. There are 11 clinical features for predicting heart disease events. We will use classification methods to try and correctly predict if there is an underlying heart disease or not (binary classification). We will use 3 models:

KNNClassifier
AMFClassifier and
LogisticRegression

We will also preprocess the data. In particular we will use

StandardScaler for number type features and
OneHotEncoder for str type features

Finally, to evaluate the models we will use 3 metrics:

Accuracy
Recall
ROCAUC

Let's see how we would write the beaver file:

Connector

We start with the connector. We need to define the bootstrap_servers (where do we connect) and the security_protocol (plaintext/ssl) variables. All the others are optional

connector {
        bootstrap_servers = "localhost:39092"
        security_protocol = "plaintext"
        consumer_group = 'heart-failure'
        auto_offset_reset = "earliest"
}

Models

The models can be defined in whatever order you want. The only prerequisite is that if a model is used as a parameter on another model, you need to define this model first.

So in our case we have 3 algorithms (KNNClassifier, AMFClassifier, LogisticRegression), 1 optimizer (SGD), 2 composers of type SelectType for picking the numbers and str separately. Note: If you want to select a certain type of values and process them but you also want to maintain the other types you need to explicitly also select the others and just pass them through, otherwise River will throw an error. If you don't want the other types, its better to keep only the features you want with said type otherwise River expects a preprocessor type for these features too.

Then we have our 2 preprocessors (OneHotEncoder, StandardScaler) and finally our 3 metrics (Accuracy, Recall ROCAUC). The models can be defined as such:

algorithm <KNNClassifier> knn


algorithm <AMFClassifier> amf
    params:
        n_estimators=10,
        use_aggregation=True,
        dirichlet=0.5,
        seed=1

optimizer <SGD> sgd
    params:
        lr = 0.1

algorithm <LogisticRegression> logistic
    params:
        optimizer = sgd

composer <SelectType> select
    params:
        (int , float)

composer <SelectType> selectstr
    params:
        str


metric <Accuracy> accuracy

metric <Recall> recall

metric <ROCAUC> roc

preprocessor <OneHotEncoder> encoder
preprocessor <StandardScaler> scaler

As you can see there is no order apart from the optimizer being initialized before the logistic regressor model that is using it.

After that we need to define the data that will use. We will define the input source, the features that we want (in our case we will define our target feature) and the preprocessors that are going to be used. To define a pipeline of preprocessors we use the | symbol while to imply that they will be used on different data we use the + symbol

data Heart_Failure_Prediction {

    input_topic = "Heart_Failure_Prediction"
    features:
        target_feature = HeartDisease
    preprocessors = select | scaler + selectstr | encoder

}

Finally we define our pipelines. We need to define

The output source (optional)
The data we will use (in our case Heart_Failure_Prediction)
The algorithm of the pipeline
The metrics that we will use for evaluating the model

Our 3 pipelines are:

pipeline KNNClassifierPipeline {
    output_topic = 'KNNClassifierPipeline'
    data = Heart_Failure_Prediction
    algorithm = knn
    metrics = accuracy , recall , roc
}

pipeline amfClassifierPipeline {
    output_topic = 'amfClassifierPipeline'
    data = Heart_Failure_Prediction
    algorithm = amf
    metrics = accuracy , recall , roc
}

pipeline logisticPipeline {
    output_topic = 'logisticPipeline'
    data = Heart_Failure_Prediction
    algorithm = logistic
    metrics = accuracy , recall , roc
}

And that's it! Now the final Beaver file is :

connector {
        bootstrap_servers = "localhost:39092"
        security_protocol = "plaintext"
        consumer_group = 'heart-failure'
        auto_offset_reset = "earliest"
}

algorithm <KNNClassifier> knn


algorithm <AMFClassifier> amf
    params:
        n_estimators=10,
        use_aggregation=True,
        dirichlet=0.5,
        seed=1

optimizer <SGD> sgd
    params:
        lr = 0.1

algorithm <LogisticRegression> logistic
    params:
        optimizer = sgd

composer <SelectType> select
    params:
        (int , float)

composer <SelectType> selectstr
    params:
        str


metric <Accuracy> accuracy

metric <Recall> recall

metric <ROCAUC> roc

preprocessor <OneHotEncoder> encoder
preprocessor <StandardScaler> scaler

data Heart_Failure_Prediction {

    input_topic = "Heart_Failure_Prediction"
    features:
        target_feature = HeartDisease
    preprocessors = select | scaler + selectstr | encoder

}


pipeline KNNClassifierPipeline {
    output_topic = 'KNNClassifierPipeline'
    data = Heart_Failure_Prediction
    algorithm = knn
    metrics = accuracy , recall , roc
}

pipeline amfClassifierPipeline {
    output_topic = 'amfClassifierPipeline'
    data = Heart_Failure_Prediction
    algorithm = amf
    metrics = accuracy , recall , roc
}

pipeline logisticPipeline {
    output_topic = 'logisticPipeline'
    data = Heart_Failure_Prediction
    algorithm = logistic
    metrics = accuracy , recall , roc
}

The beaver model figure can be seen below:

beaver-model

Connector​

Models​

Connector

Models