Opportunistic View Materialization with Deep Reinforcement Learning

Materialized views, or pre-computed query results, can greatly speed up query answering in read-mostly database systems. In this project, we explore opportunistic materialization, where a database identifies and caches important intermediate results for future use during execution. Since real systems have storage constraints, the crucial problem is to design selection and eviction strategies.

It can be tricky to design such strategies by hand. Classical buffer management heuristics, like LFU or LRU, only consider usage rates and not the impact the view has for queries in a specific workload.  Other heuristics like HAWC [1] and Recycler [2] can be sensitive to query optimizer cost estimates and heuristic choices or both. Rather than hand coding a heuristic, can we allow the database to self-optimize its own materialized view cache?

Inspired by recent work that uses Deep RL to more accurately optimize queries (through learning from execution feedback) [3], we apply similar approach to opportunistic materialization. The system starts with a random selection strategy and observes choices that seem to improve actual query runtime. As it takes decisions it builds a model that correlates these choices to features of the views created, workload, and system state. Decisions that helped in the past are remembered and ones that hurt are forgotten.

The set of possible views are huge and directly applying standard RL algorithms can take a very long time to converge as the aggregate query runtimes are a very weak performance signal. Our insight is that a better view selection policies can be effectively trained with a novel asynchronous RL algorithm which runs paired counterfactual experiments during system idle times to evaluate the incremental value of persisting certain views. It runs the query with and without the view and uses the difference in performance as a feedback signal.

In our research prototype system called DQM, we focus on inner-join views. We implemented a view miner that can generate view candidates by mining the logical plan of queries, a query rewriter that can rewrite a query with a view and a view manager that can manage (create/drop) views. We integrate our system with SparkSQL and build our DeepRL agent for decision making (view selection and eviction) with Python that communicate with the Spark environment via the RESTful API.

We evaluate DQM with workloads derived from the Join-Order-Benchmark and TPC-DS. Results suggest that DQM is more robust across different workloads and performs competitively with a hypothetical near-optimal baseline after training. DQM is especially powerful when there are stochastic effects in the system (e.g., maintenance) that lead to additional costs.

Moving Beyond “1 Device, 1 Model”

Multi-tenancy, where multiple computing applications are served by the same infrastructure, ensures efficient resource utilization in public or private cloud systems. This is especially important for long running jobs like machine learning model training—models which power everything from facial recognition to natural language understanding. Training a modern ML model is AMAZINGLY expensive, and it is common to have 100-1000s of GPUs in industrial infrastructure.

Today’s popular model training frameworks insist on exclusive use of allocated devices; a “1 device, 1 model” mantra.  While this design paradigm makes scheduling and resource management easy (or at least easy’ish), it is not a sustainable paradigm as the number of different models that we train outpaces our hardware resources. But before we can build such systems, a first question is how to a train model in dynamic environments with changing or pre-empted resources.

In our first project, we look at the problem of auto-scaling Stochastic Gradient Descent-based training tasks. If a training task observes that additional threads are available, should it try request those resources? Counter-intuitively, this may hurt model performance. SGD repeatedly samples random training data and calculates a sample approximation of the gradient (the direction to update the model to lower the prediction error). It takes a step in the direction of the gradient and updates the model and repeats until convergence. Intuitively, as long as the updates are in the right direction on average and the approximation error is small relative to the size of the steps, SGD converges.  Low-level concerns like the level of parallelism and the frequency of synchronization can affect the optimization algorithm by increasing the number of conflicted updates (making each step less exact) or forcing inconvenient synchronization (the time to take an exact step longer).

Not surprisingly to safely auto-scale SGD, one must make these repeated steps “less aggressive” when there are more conflicts. We develop a principled theory that relates the sampling batch size, number of threads, step size, and momentum parameters. Scaling on one axis requires an adjustment on others. Our system, which we call OPT^2, automatically adjusts and evaluates the effects of those adjustments to ensure stable and accurate model training.

Threads are only one form of contended resource. We also recently explored training models under dynamic and stringent memory constraints during training. We explore gracefully degrading the precision of the forward and backward passes of a Convolutional Neural Network to accommodate these memory constraints. This approximation focuses on the convolutional layers of a neural network and we degrade the performance in the frequency domain (akin to what happens in JPEG compression). Due to the mathematical properties of such networks, that already favor low-frequency components, the accuracy of the trained model degrades smoothly with the degree of approximation.  We find competitive reductions in memory usage and floating point operations to reduced precision arithmetic—but with the added advantage of training increased stability and a full continuum of compression ratios.

This form of training, which we call band-limiting,  may additionally provide a new perspective

on adversarial robustness. Adversarial attacks on neural networks tend to involve high-frequency perturbations of input data. By truncating the convolutional operations in the frequency domain,  we may introduce an inherent robustness to high-frequency attacks. Our experiments suggest that band-limited training produces models that can better reject such noise than their full spectra counterparts.

So what have we learned? Dynamic resource management during model training is hard, but the mathematical properties of machine learning models may give us more flexibility than we are used to.  Multi-tenant model training systems will have to come to terms with such problems on the intersection of machine learning, signal processing, and distributed systems.