Beaver - From Bits to Being

Hello and welcome to another blog by Jason. It’s been quite some time since I’ve last posted. I am currently working on my internship and I haven’t been able to keep up with posting. But today it is a really important day because I finally finished writing my thesis! And what better day to inform you about my thesis: Beaver! So let’s dive into it

What is Beaver

Beaver is a DSL language design for online machine learning in live data (data streams). It is a declarative language aimed at simplifying the production of Python code that encapsulate the total infrastructure, from connecting to a database to displaying the results. It can be used by software engineers, data analysts as well as people that are not expert in coding.

Beaver tools

Beaver is using multiple tools to achieve such massive integration. Below I will explain each tool I am using inside my thesis to better understand how it functions.

TextX

First and foremost is TextX. TextX is a meta-language (i.e. a language for language definition) for domain-specific language (DSL) specification in Python. Using this tool, I am able to defining the grammar of my language and create the necessary entities that compromise it. In the case of Beaver we have multiple entities:

ModelNames: Defining the available algorithm classes that can be used
ModelModules: Defining the modules of a group of ModelNames
ModelGroups: Groups of Modules for better categorization
ModelTypes: Defining the structure of each ModelGroup inside a Beaver file (.bvr). Each model type has 3 parameters: - type: the name of the River algorithm class - name: the id of of the object - params: a list of parameters of the class (optional)
Model: Parent class of ModelTypes. Provides an abstract class in order to be used by other entities
DataModel: subclass of Model that defines the available ModelTypes that can be used for the processing of the data.
Multiple types of variables such as Lists, Tuples, Dicts, References to other models etc
Comments
Connector: The component used for connecting to a database (usually Kafka)
Features: The features that data have. They are seperated in 4 main types: - keep_features: features that we want to keep. - drop_features: features we want to be deleted. - generated_features: complex features generated by the features of the data - target_features: the y value of data
assignments & expressions: for defining the generated_features
Data: Used for defining the data: from the table to the features we defined earlier and the preprocessors used at the data
ProcList: A component defining complex preprocessing steps in series or in parallel
Pipeline: The main system defining all the steps from the data we want, the preprocessing steps, the algorithm used, the output topic and the metrics used for evaluating the algorithm.
BeaverModel: Defining the structure of a .bvr file

All these are used for creating a friendlier and easier experience for the end user.

Jinja

Jinja is the translator in this project. It is used for creating the Python file from the Beaver file. In order to do this, I have defined all the necessary transformations, from the import modules, the class definitions, the pipeline creation to the graphs displayed in the dashboard. All have to be carefully added to have a correct translation without any errors.

Kafka

Kafka is a distributed event streaming platform. It’s my data distributor: The producers of data (API, sensors etc) send data to kafka, and the consumers (ml algorithms) consume those data for learning. It is the main infrastructure point of my thesis as it stores all the data consumed and produced. It should be said that kafka can’t be used as a database. It stores the data for a brief amount of time (can be configured) and doesn’t have the capabilities of data bases. In my thesis, it is used as a storage, since I haven’t integrated a database on my stream, but it could be added on later updates. In Beaver, the confluent library of kafka is used as it is easier to be used, better documented and most importantly, it is written is Python, meaning no need of weird java wrappers that may or may not work.

For the infrastructure of kafka I use a docker yaml file that contains 3 brokers (for storing data), 3 controllers (for controlling where data is stored) and 1 kafka ui. More info on the docker section below

Docker

I have talked about Docker on previous blog posts so I won’t get into much detail about what it is and what it does. But i will inform you about how I use it in Beaver. Docker is how I manage and build my kafka infrastructure. For the 3 controllers, 3 brokers and kafka ui to communicate effectively, the best solution was to create a yaml file defining every image and adding them to the same network so that they can communicate with their names. This way there are no communication issues with each other. Furthermore you can add outside connection to the containers. For example I can specify connections outside my network when I am not at home, and add a sasl configuration if necessary to verify the user with a username and password. Because I wanted to make my kafka cluster as simple as possible, I haven’t added and security configurations. The kafka cluster should be considered a developer cluster only and not production ready. In contrast, the connection class of Beaver has the ability to connect via sasl so it is ready to be used in production.

Quixstreams

Quixstreams is a Python native library for processing data on kafka. It it the “pipes” of my pipeline, the way of connecting kafka with consumers (and producers) and preprocessing the data arriving to the models. It is a really versatile tool, with abilities such as windowing, filtering and projecting. The community and the developers are really helpful (they will answer to any of you questions on their slack team) and I really enjoy using it since it was the only native Python library I found for my use case. Apache flink (which was my first choice) is written in Java and is using wrappers for their Python version and I didn’t manage to make it work. If you wan’t a tool for such case such as mine, I wholeheartedly recommend using it.

River

Now we have come to one of the most important part of Beaver. River is a Python library for online machine learning algorithms. It is the most complete and popular tool used for online learning and it is highly versatile. With multiple classes of algorithms, ranging from classification and regression to anomaly and drift detection, it is the tool used for learning and predicting in Beaver. As of now, it is not fully supported in Beaver, more specifically some families like anomaly detection are not well displayed in Beaver’s dashboard and others like neural networks are prone to bugs since Beaver is not using epochs to train them, but I plan to support most, if not all the algorithms in future updates. I think River is an excellent tool for machine learning in live data and I highly recommend it. I hope it gets the support needed for further updates since it is on the only libraries for this machine learning sector.

Plotly & Dash

Plotly and dash are the latest additions to the Beaver language. They are used as the tools for displaying metrics and other useful info for the user on a webpage. Beaver shows metric updates, classes and number of items in each class, true and predicted y difference in the same axis and many more. It is really important for displaying how algorithms learn in real time.

Working with Beaver

To learn how to use Beaver you can read the documentation page. The repo also has a README file for quick setup.

Writing Beaver file

I tried making writing a beaver file as easy as possible. For a .bvr file to be valid you need to structure it as BeaverModel is defined, meaning, you need:

1 connector
1+ models
1+ data
1+ pipelines

Beaver CLI

For the better User experience I have developed a Beaver CLI. Using this program you can validate, analyze and generate you .py file

You can validate your model using:

python beaver_cli.py validate --input examples/linear.bvr --verbose

If you want to analyze it meaning checking about any issues and finding your .bvr file’s complex score you can use:

python beaver_cli.py analyze --input examples/linear.bvr

To generate you code (it also validates it) you can run:

# Basic generation with validation
python beaver_cli.py generate --input examples/linear.bvr --output my_pipeline.py

For help and documentation:

python beaver_cli.py help

# Or for specific commands
python beaver_cli.py generate --help

Issues and Improvements

Of course each project has its downsides. For now Beaver, as mentioned, does not fully support River so you might encounter issues while using it. I highly recommend opening an issue to let me know your issue/bug and fix it in later updates. Furthermore the dashboard does not completely support all the algorithm types, meaning sometimes you might not get all the useful info from the dashboard that you might need. Again, i recommend opening an issue on github to let me know which feature you want me to add in later updates. Finally, one major downside is, as of right now I only support Python. This was my intention from the beginning since i don’t have the capacity to write the translation of beaver on another language but I am open to hear your suggestions and add more contributors to Beaver.

Final thoughts

As my thesis comes to an end, I would like to share a few of the things that I learnt throughout this journey.

First of all, in order to learn something you need to combine theoretical understanding and practical knowledge. I know this may sound obvious, but I think a lot of people miss this balance. Reading books, documentation, papers etc is good for gaining an understanding about the formulas and theorems used in a ml algorithm, pipeline etc, but you also need to implement it or at least see it in action to fully grasp how it works. On the other hand, making small projects and developing a prototype of a project is a really nice way of learning the configuration of a class, pipeline, model, but you also need to have at least a basic understanding of the background theories that support your project. I am not saying becoming an expert, but a basic understanding of both worlds is necessary.

In addition, long term projects are better than short term experiments. I am not saying that they are not useful, but you need to have an end goal, something that you want to achieve in order to me ambitious and have the drive to learn more about one subject. You can start with small projects, but from my experience, I have learnt more from either by having a long term goal, or by wanting to solve a everyday issue that I have. Having to run a project for 11 straight months will improve your planning skills, organization and seperation of concerns. You can learn how to divide a big problem into smaller ones, how to combat issues, how to learn new things and apply them in your project and most importantly, how to ask for help and communicate you problems. I think the last one is usually overlooked, but from my experience, I think people have lost the abaility to communicate and understand other people’s emotions, which makes us more egoists and we don’t neither ask for help nor help other people. I believe by having a really long project that you know little to nothing about in the beginning, humbles you and makes you value communication more.

Last but not least, to make something of quality that consumes a good amount of your time, you need to really like what you do. I don’t mean being the only thing on your life, this is unhealthy and you will burnout, but you need to be curious, to want to learn more about the subject, to value feedback and to want to improve your project. A thesis is like your baby. You develop it step by step, watch it grow and get bigger, fail and succeed, and finally maturing enough so you can be proud of it.

Beaver is my baby. I really enjoyed developing it and I hope I still have time in the future updating it. I am glad I found something I really enjoyed doing and I hope it becomes a valuable tool for other people tool.

Thank you for your time ❤️

← Previous Post