Jeroen Keppens | Show Publication

Adaptive Caching Using Sub-query Fragmentation for Reduction in Data Transfers from Distributed Databases

Kuppili Venkata, S., Keppens, J. and Musial, K.

Proceedings of the Astronomical Data Analysis Software and Systems (ADASS) Conference XXV. TBD.

2016

Abstract

One of the challenges with Big Data is to transfer massive amounts of data from data server(s) to users. Unless data transfers are planned, organized and regulated carefully, they can become a potential bottleneck and may lead to change the way of querying databases and even the design of the backend data structures. This is a pronounced problem in the case of virtual observatories where data is to be brought from multiple astronomical databases all around the world. To reduce data transfers here we propose an adaptive middleware caching using sub-query caching technique. Sub-query caching technique involves fragmenting the query into smaller sub queries. A sub-query is defined as the part of the query which can be separated as a block of data or a data object. This technique applies association rules over the database-specific data localization during the query processing to identify the optimum grain of data to be cached. As the query is cached as smaller objects, it achieves reduction in the processing costs needed for joins. Also reduction in the data transfers is achieved as parts of the query is already present at the cache. A distributed database environment is simulated incorporating the key features of real life scenario with multiple user groups querying through common query interface. Synthetic query sets are generated for input with varied complexity and sub-query repetition. Initial experiments performed with these input sets showed considerable reduction in the response time when used our approach compared to full query cache method. We used association algorithms and decision trees for cache training and maintenance. Experiments showed reductions in data transfers needed with our fully trained cache compared to the amount of data transfers needed when entire columns of data to be transferred from data server to the middleware location. Future work includes (i) developing a mobile architecture with central control for cache units based on the popularity of the data generated from data usage patterns and (ii) query approximation to estimate the exact need of data.