Although AutoML rose to popularity a few years ago, the ealy work on AutoML dates back to the early 90’s when scientists published the first papers on hyperparameter optimization. It was in 2014 when ICML organized the first AutoML workshop that AutoML gained the attention of ML developers. One of the major focuses over the years of AutoML is the hyperparameter search problem, where the model implements an array of optimization methods to determine the best performing hyperparameters in a large hyperparameter space for a particular machine learning model. Another method commonly implemented by AutoML models is to estimate the probability of a particular hyperparameter being the optimal hyperparameter for a given machine learning model. The model achieves this by implementing Bayesian methods that traditionally use historical data from previously estimated models, and other datasets. In addition to hyperparameter optimization, other methods try to select the best models from a space of modeling alternatives.
In this article, we will cover LightAutoML, an AutoML system developed primarily for a European company operating in the finance sector along with its ecosystem. The LightAutoML framework is deployed across various applications, and the results demonstrated superior performance, comparable to the level of data scientists, even while building high-quality machine learning models. The LightAutoML framework attempts to make the following contributions. First, the LightAutoML framework was developed primarily for the ecosystem of a large European financial and banking institution. Owing to its framework and architecture, the LightAutoML framework is able to outperform state of the art AutoML frameworks across several open benchmarks as well as ecosystem applications. The performance of the LightAutoML framework is also compared against models that are tuned manually by data scientists, and the results indicated stronger performance by the LightAutoML framework.
This article aims to cover the LightAutoML framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started.
Although researchers first started working on AutoML in the mid and early 90’s, AutoML attracted a major chunk of the attention over the last few years, with some of the prominent industrial solutions implementing automatically build Machine Learning models are Amazon’s AutoGluon, DarwinAI, H20.ai, IBM Watson AI, Microsoft AzureML, and a lot more. A majority of these frameworks implement a general purpose AutoML solution that develops ML-based models automatically across different classes of applications across financial services, healthcare, education, and more. The key assumption behind this horizontal generic approach is that the process of developing automatic models remains identical across all applications. However, the LightAutoML framework implements a vertical approach to develop an AutoML solution that is not generic, but rather caters to the needs of individual applications, in this case a large financial institution. The LightAutoML framework is a vertical AutoML solution that focuses on the requirements of the complex ecosystem along with its characteristics. First, the LightAutoML framework provides fast and near optimal hyperparameter search. Although the model does not optimize these hyperparameters directly, it does manage to deliver satisfactory results. Furthermore, the model keeps the balance between speed and hyperparameter optimization dynamic, to ensure the model is optimal on small problems, and fast enough on larger ones. Second, the LightAutoML framework limits the range of machine learning models purposefully to only two types: linear models, and GBMs or gradient boosted decision trees, instead of implementing large ensembles of different algorithms. The primary reason behind limiting the range of machine learning models is to speed up the execution time of the LightAutoML framework without affecting the performance negatively for the given type of problem and data. Third, the LightAutoML framework presents a unique method of choosing preprocessing schemes for different features used in the models on the basis of certain selection rules and meta-statistics. The LightAutoML framework is evaluated on a wide range of open data sources across a wide range of applications.
LightAutoML : Methodology and Architecture
The LightAutoML framework consists of modules known as Presets that are dedicated for end to end model development for typical machine learning tasks. At present, the LightAutoML framework supports Preset modules. First, the TabularAutoML Preset focuses on solving classical machine learning problems defined on tabular datasets. Second, the White-Box Preset implements simple interpretable algorithms such as Logistic Regression instead of WoE or Weight of Evidence encoding and discretized features to solve binary classification tasks on tabular data. Implementing simple interpretable algorithms is a common practice to model the probability of an application owing to the interpretability constraints posed by different factors. Third, the NLP Preset is capable of combining tabular data with NLP or Natural Language Processing tools including pre-trained deep learning models and specific feature extractors. Finally, the CV Preset works with image data with the help of some basic tools. It is important to note that although the LightAutoML model supports all four Presets, the framework only uses the TabularAutoML in the production-level system.
The typical pipeline of the LightAutoML framework is included in the following image.
Each pipeline contains three components. First, Reader, an object that receives task type and raw data as input, performs crucial metadata calculations, cleans the initial data, and figures out the data manipulations to be performed before fitting different models. Next, the LightAutoML inner datasets contain CV iterators and metadata that implement validation schemes for the datasets. The third component are the multiple machine learning pipelines stacked and/or blended to get a single prediction. A machine learning pipeline within the architecture of the LightAutoML framework is one of multiple machine learning models that share a single data validation and preprocessing scheme. The preprocessing step may have up to two feature selection steps, a feature engineering step or may be empty if no preprocessing is needed. The ML pipelines can be computed independently on the same datasets and then blended together using averaging (or weighted averaging). Alternatively, a stacking ensemble scheme can be used to build multi level ensemble architectures.
LightAutoML Tabular Preset
Within the LightAutoML framework, TabularAutoML is the default pipeline, and it is implemented in the model to solve three types of tasks on tabular data: binary classification, regression, and multi-class classification for a wide array of performance metrics and loss functions. A table with the following four columns: categorical features, numerical features, timestamps, and a single target column with class labels or continuous value is feeded to the TabularAutoML component as input. One of the primary objectives behind the design of the LightAutoML framework was to design a tool for fast hypothesis testing, a major reason why the framework avoids using brute-force methods for pipeline optimization, and focuses only on efficiency techniques and models that work across a wide range of datasets.
Auto-Typing and Data Preprocessing
To handle different types of features in different ways, the model needs to know each feature type. In the situation where there is a single task with a small dataset, the user can manually specify each feature type. However, specifying each feature type manually is no longer a viable option in situations that include hundreds of tasks with datasets containing thousands of features. For the TabularAutoML Preset, the LightAutoML framework needs to map features into three classes: numeric, category, and datetime. One simple and obvious solution is to use column array data types as actual feature types, that is, to map float/int columns to numeric features, timestamp or string, that could be parsed as a timestamp — to datetime, and others to category. However, this mapping is not the best because of the frequent occurrence of numeric data types in category columns.
Validation Schemes
Validation schemes are a vital component of AutoML frameworks since data in the industry is subject to change over time, and this element of change makes IID or Independent Identically Distributed assumptions irrelevant when developing the model. AutoML models employ validation schemes to estimate their performance, search for hyperparameters, and out-of-fold prediction generation. The TabularAutoML pipeline implements three validation schemes:
- KFold Cross Validation: KFold Cross Validation is the default validation scheme for the TabularAutoML pipeline including GroupKFold for behavioral models, and stratified KFold for classification tasks.
- Holdout Validation : The Holdout validation scheme is implemented if the holdout set is specified.
- Custom Validation Schemes: Custom validation schemes can be created by users depending on their individual requirements. Custom Validation Schemes include cross-validation, and time-series split schemes.
Feature Selection
Although feature selection is a crucial aspect of developing models as per industry standards since it facilitates reduction in inference and model implementation costs, a majority of AutoML solutions do not focus much on this problem. On the contrary, the TabularAutoML pipeline implements three feature selection strategies: No selection, Importance cut off selection, and Importance-based forward selection. Out of the three, Importance cut off selection feature selection strategy is default. Furthermore, there are two primary ways to estimate feature importance: split-based tree importance, and permutation importance of GBM model or gradient boosted decision trees. The primary aim of importance cutoff selection is to reject features that are not helpful to the model, allowing the model to reduce the number of features without impacting the performance negatively, an approach that might speed up model inference and training.
The above image compares different selection strategies on binary bank datasets.
Hyperparameter Tuning
The TabularAutoML pipeline implements different approaches to tune hyperparameters on the basis of what is tuned.
- Early Stopping Hyperparameter Tuning selects the number of iterations for all models during the training phase.
- Expert System Hyperparameter Tuning is a simple way to set hyperparameters for models in a satisfactory fashion. It prevents the final model from a high decrease in score compared to hard-tuned models.
- Tree Structured Parzen Estimation or TPE for GBM or gradient boosted decision tree models. TPE is a mixed tuning strategy that is the default choice in the LightAutoML pipeline. For each GMB framework, the LightAutoML framework trains two models: the first gets expert hyperparameters, the second is fine-tuned to fit into the time budget.
- Grid Search Hyperparameter Tuning is implemented in the TabularAutoML pipeline to fine-tune the regularization parameters of a linear model alongside early stopping, and warm start.
The model tunes all the parameters by maximizing the metric function, either defined by the user or is default for the solved task.
LightAutoML : Experiment and Performance
To evaluate the performance, the TabularAutoML Preset within the LightAutoML framework is compared against already existing open source solutions across various tasks, and cements the superior performance of the LightAutoML framework. First, the comparison is carried out on the OpenML benchmark that is evaluated on 35 binary and multiclass classification task datasets. The following table summarizes the comparison of the LightAutoML framework against existing AutoML systems.
As it can be seen, the LightAutoML framework outperforms all other AutoML systems on 20 datasets within the benchmark. The following table contains the detailed comparison in the dataset context indicating that the LightAutoML delivers different performance on different classes of tasks. For binary classification tasks, the LightAutoML falls short in performance, whereas for tasks with a high amount of data, the LightAutoML framework delivers superior performance.
The following table compares the performance of LightAutoML framework against AutoML systems on 15 bank datasets containing a set of various binary classification tasks. As it can be observed, the LightAutoML outperforms all AutoML solutions on 12 out of 15 datasets, a win percentage of 80.
Final Thoughts
In this article we have talked about LightAutoML, an AutoML system developed primarily for a European company operating in the finance sector along with its ecosystem. The LightAutoML framework is deployed across various applications, and the results demonstrated superior performance, comparable to the level of data scientists, even while building high-quality machine learning models. The LightAutoML framework attempts to make the following contributions. First, the LightAutoML framework was developed primarily for the ecosystem of a large European financial and banking institution. Owing to its framework and architecture, the LightAutoML framework is able to outperform state of the art AutoML frameworks across several open benchmarks as well as ecosystem applications. The performance of the LightAutoML framework is also compared against models that are tuned manually by data scientists, and the results indicated stronger performance by the LightAutoML framework.