Hyperparameters for AI models are the levers that can be adjusted to affect training times, performance and accuracy to create better models. But testing the performance of different lever combinations, a process known as hyperparameter optimization, comes at a cost to both compute and human labor.
Consider hyperparameters as building blocks of AI models. You can tweak the parameters or features that go into a model or what that model does with the data it gets in the form of hyperparameters, e.g., how fast or slow a model should go in order to find the optimal value. Capacity (number of parameters) is determined by the model structure while flexibility (hypothesis set) is determined by the machine learning algorithms. Hyperparameters can take the form of continuous values like learning rate in neural network training, discrete values like the number of layers in the neural network or an algorithm like an activation function.
The importance and potential impact of adjusting hyperparameters depends on industry and use case, but experts said tuning during the design process — rather than retrofitting parameters — enables easier control.
Hyperparameter tuning should be part of the design process
Hyperparameter tuning should be integrated in the research pipeline along with feature engineering, preprocessing and feature transformation. Hyperparameters need to be tuned alongside a variety of project-specific factors. For example, developers may find a model that works well for predicting customer behavior in New York performs better when the hyperparameter is retuned for training a new model for a customer in California.
Enterprises must also sort out when it makes sense to readjust hyperparameters as part of any ongoing maintenance. AI model accuracy and quality tends to decrease over time in response to data changes, according to Ryohei Fujimaki, CEO of data science platform dotData. Hyperparameter tuning along with retraining is then required to achieve similar accuracy. But if this process relies on manual effort, it is hard to continuously retrain models.
“Make [hyperparameter tuning] part of your design process, not an afterthought,” said Vivek Vaid, CTO of FourKites, a supply chain software provider.
How Zillow approaches hyperparameters
Real estate firm Zillow has embedded hyperparameter optimization into its development process to tune a set of algorithms for various use cases.
“Allocating time for hyperparameter optimization during model development and preparation for deployment can therefore tune performance for a specific task and achieve a significant lift in metrics,” said Ondrej Linda, senior manager of applied science at Zillow.
Linda and his team found that while the same machine learning algorithm might be applied in different areas such as sort ranking, home similarity prediction or email click-through prediction, it is unlikely the same hyperparameters would yield the best performance across these use cases.
For example, a machine learning algorithm is more likely to overfit on training data with a lot of noise and will therefore require careful hyperparameter tuning. However, the same algorithm may work fine without hyperparameter tuning on cleaner sets of data with less noise. It can be challenging to make hyperparameter tuning a regular practice since it’s frequently deprioritized for more immediate needs, Linda said.
“Since it is difficult to predict how difficult a learning task is, performing hyperparameter tuning in all cases is a good habit,” Linda said.
He believes it is important to invest in ways to make hyperparameter tuning a natural part of the training pipeline. When hyperparameter tuning is seen as an afterthought, the research into how far parameters can affect the model stalls — retrofitting parameters is not about exploration, but model-specific solutions. “The faster we can make the hyperparameter tuning process [part of the process], the larger the parameter space we will be able to explore,” Linda said.
Linda recommended finding ways to reduce the search for different combinations using tools like Bayesian search to prioritize more promising combinations rather than just exhaustively searching through combinations. He suggested testing different hyperparameters on a subset of the training data to see if things are moving in the right direction. If they are, data scientists can follow up with the full set of training data to identify the final configuration.
Balance complexity and training time
David Yakobovitch, principal data scientist at technology education company Galvanize, said many data science platforms are getting better at using AutoML techniques to evaluate new combinations of hyperparameters earlier in the training process. This saves companies from spending too much money on training models that won’t pan out.
Hyperparameters play a role in two key areas: model complexity and model training time.
Model complexity may relate to factors like the depth or structure of a neural network or the number and depth of each tree in a random forest search. Simpler models won’t learn as much while complex models may overfit the training data, causing the model to predict poorly on unseen data.
“Machine leaning scientists must use prudence on model selection based on type of application and [the] size of training data, since there is a tradeoff between model complexity and infrastructure cost,” FourKites’ Vaid said.
Applications in less computationally intensive areas such as income prediction or loan default prediction benefit from the use of simpler models like linear or tree-based models. Other applications such as computer vision or natural language processing (NLP) tend to work better with complex models like dense neural networks.
There are various tradeoffs in adjust hyperparameters that affect model training time. If the learning rate configured through hyperparameters is set very low, the model takes too long to converge to an optimal solution, said Ramesh Hariharan, CTO of LatentView Analytics. If it is set too high, it may not converge at all.
Even in simple neural networks, the modeler needs to specify numerous hyperparameters — learning rate, number of hidden layers and units, activation functions, batch size, epochs, regularization and dropout — before training the model on the data.
“The bigger challenge is that there are many hyperparameters, and the data scientist needs to find the right combination of settings that works best for the given situation,” Hariharan said.
How, when and why to tune hyperparameters
Some of the biggest considerations in hyperparameter tuning lie in figuring out how to do it, when it will be done and why it is done.
How will tuning be done? Will it be conducted manually — through a methodical process such as a grid search — or via automation techniques? What are the tradeoffs to each approach? You will also need to consider time constraints, compute and tooling costs, and required accuracy for your model.
When will hyperparameter tuning be done in the development cycle? One approach is to first optimize your feature set and model, and then tune your hyperparameters. You could also tinker with the hyperparameters in tandem with iterating your features and model.
Why is hyperparameter tuning being done? Developers need to understand what they are tuning for upfront, so they are solving appropriate problems for the business. The motivation for tuning should go back to business impact, said Arijit Sengupta, founder and CEO of AI software company Aible.
Hyperparameter tuning must be contextualized through business goals, because a model tuned for accuracy assumes all costs and benefits are equal. In business, the benefit of a correct prediction is almost never equal to the cost of a wrong prediction.
“When you optimize hyperparameters, make sure that you’re optimizing for business impact such as more revenue, lower cost, less waste and fraud, and fewer customers lost to churn, not model accuracy,” Sengupta said.
Common hyperparameter techniques
Rosaria Silipo, principal data scientist at KNIME, a data science software provider, said some of the most common optimization strategies include the following:
- Grid search is a basic brute-force strategy. If you do not know which values to tune, you try them all within a range using a fixed step.
- Random search selects the values of the hyperparameters randomly and is particularly efficient when some hyperparameters affect the final metric more than others.
- Hill climbing is a strategy where at each iteration, the next value is selected in the best direction of the hyperparameter space. If no neighbors improve the final metric, the optimization loop stops.
- Bayesian optimization selects the next hyperparameter value based on the previous iterations, like the hill climbing strategy. Unlike hill climbing, however, Bayesian optimization looks at past iterations globally and not only at the last one.