{ "id": "1707.04566", "version": "v1", "published": "2017-07-14T17:28:28.000Z", "updated": "2017-07-14T17:28:28.000Z", "title": "Pushing the Limits of Online Auto-tuning: Machine Code Optimization in Short-Running Kernels", "authors": [ "Fernando Endo", "Damien Couroussé", "Henri-Pierre Charles" ], "comment": "Extension of a Conference Paper published in the proceedings of MCSoC-16: IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, Lyon, France, 2016", "categories": [ "cs.PF" ], "abstract": "We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying auto-tuning directly at the level of machine code generation. This allows auto-tuning to pay off in very short-running applications. As a proof of concept, our approach is demonstrated in two benchmarks, which execute during hundreds of milliseconds to a few seconds only. In a CPU-bound kernel, the average speedups achieved are 1.10 to 1.58 depending on the target micro-architecture, up to 2.53 in the most favourable conditions (all run-time overheads included). In a memory-bound kernel, less favourable to our runtime auto-tuning optimizations, the average speedups are 1.04 to 1.10, up to 1.30 in the best configuration. Despite the short execution times of our benchmarks, the overhead of our runtime auto-tuning is between 0.2 and 4.2% only of the total application execution times. By simulating the CPU-bound application in 11 different CPUs, we showed that, despite the clear hardware disadvantage of In-Order (io) cores vs. Out-of-Order (ooo) equivalent cores, online auto-tuning in io CPUs obtained an average speedup of 1.03 and an energy efficiency improvement of 39~\\% over the SIMD reference in ooo CPUs.", "revisions": [ { "version": "v1", "updated": "2017-07-14T17:28:28.000Z" } ], "analyses": { "keywords": [ "machine code optimization", "online auto-tuning", "short-running kernels", "average speedup", "total application execution times" ], "tags": [ "conference paper" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }