loading page

Optimizing OpenVX Graphs for Data Movement
  • Madushan Abeysinghe,
  • Jesse Villarreal,
  • Jason D. Bakos
Madushan Abeysinghe
University of South Carolina
Jesse Villarreal
Texas Instruments
Jason D. Bakos
University of South Carolina

Corresponding Author:[email protected]

Author Profile

Abstract

This paper describes a method for automatically transforming the structure and characteristics of an image processing dataflow graph for the purpose of improving performance and/or lowering memory utilization as compared to the baseline tools. Embedded image processing applications are often executed on Digital Signal Processors, or their modern equivalent Visual Processor Units. The software usually performs a series of pixel-level operations for basic color conversion, channel extraction and combining, arithmetic, and filtering. These steps can often be efficiently described as a graph. For this reason, standard libraries such as OpenVX are used, which provide a graph-based programming model where the nodes are chosen from a repertoire of common pixel-level operations and the edges represent the flow of images as they progress though the processing stages. Generally speaking, each node is processed sequentially in the order implied by the data dependencies defined by the graph structure, with all intermediate values stored in external memory. In the proposed framework, we developed performance models for both the direct memory access subsystem and the L1 data cache to allow for selection of certain intermediate values to be stored in on-chip scratchpad memory as well as selecting the most appropriate tile size. In this way, we effectively decompose the graph in a way to fuse specific sets of nodes to associate their internal edges with on-chip buffers. Additionally, the tile size is optimized for each fused set of nodes. In this paper, we describe our performance models and approach for graph decomposition and tile size selection. The proposed performance models are accurate to within 2% on average, and the overall approach of graph optimization achieves an average speedup of 1.3 and allows for reduction of average DRAM utilization from 100% to as low as 15%.
21 Feb 2024Submitted to TechRxiv
22 Feb 2024Published in TechRxiv