Optimizations for the Qualcomm platform

The following details SagivTech’s experience developing an OpenCL implementation of a bilateral filter on the Sony Xperia Z1 mobile platform, which utilizes Qualcomm’s Adreno 330. We demonstrate the various optimizations steps we used to optimize the reference OpenCL code showed on the previous page.

The Qualcomm platform

The following information and code has been tested on the Sony Xperia Z1 mobile phone. The reason for choosing this model is that it contains Qualcomm’s Adreno 330 GPU on which we run the bilateral filter code.

Here’s a short summary of Sony’s Xperia Z1 OpenCL properties:

OpenCL on the GPU

Benchmark methodology

In order to benchmark the code, I’ve used three image sizes:

256 x 256
512 x 512
1024 x 1024

Before measuring the real timings of the kernel on the GPU, a warm-up is done. The warm-up will run the kernel on the GPU for 10 runs, but will ignore the timing results. Only then will the kernel be run for 50-100 times in a row and the average time of that test is the one reported in the performance table below. This method tries to remove spikes and initialization going on behind the scenes when measuring times. That being said, the timings in the table below, still fluctuate from run to run, so the values are the average of several runs.

OpenCL code

In order to explore the benefits the Adreno 330 GPU can expose when used on the Sony Xperia Z1 device, I’ve started with a simple, non-optimized kernel. The kernel reads the data from global memory and processes one pixel per work-item. The base reference code for the optimizations described further along was described in the previous section.

The next optimization, Phase 2, was to remove the Gaussian private variable and instead have the kernel code read the values from an image prepared in the initialization phase. You should replace the following original line in the kernel

with these three lines of code

As can be seen in the performance summary table below, Phase 2 changes have dramatically improved the overall performance.

Next optimization we did, Phase 3, introduced the use of local memory. The kernel first reads the data to local memory and in the nested for loops make use of the data from local memory instead of reading it over and over from global memory. Below is the modified kernel code:

As can be seen in the performance table summary below, the local memory code changes done in Phase 3 yielded some nice speed-ups across all image sizes.

In Phase 4, we replaced the reads from global memory into local memory and used images instead of global memory. The performance gains, as expected, were minor. Here’s the code that loads the data from the input image to local memory.

OpenCL performance summary on the GPU

The following table summarizes our different optimization phases for the bilateral filter running on the Adreno 330 found in the Sony Xperia Z1 mobile device.

Performance summary

The following table summarizes the best performance timings measured on both the CPU and GPU, such that you can see the great benefits the GPU brings to the table. In order to get the most accurate timing measurement, we averaged the GPU timings over 50 runs. Also, running 8 CPU native threads, instead of 4, didn’t yield better performance, as one could expect.

Conclusion

As can be seen by the benchmarks, the bilateral filter running on the GPU is about 26 to 32 times faster than the same filter running on the CPU using 4 threads. Now that the bilateral filter, or any other compute intensive task, is running on the GPU, the CPU can be more responsive to other tasks at hand, while the heavy compute tasks are being handled by the GPU faster, improving overall system performance and efficiency.

This project is partially funded by the European Union under thw 7th Research Framework, programme FET-Open SME, Grant agreement no. 309169

Written by: Eyal Hirsch, GPU Computing Expert, Mobile GPU Leader, SagivTech.

Legal Disclaimer:

You understand that when using the Site you may be exposed to content from a variety of sources, and that SagivTech is not responsible for the accuracy, usefulness, safety or intellectual property rights of, or relating to, such content and that such content does not express SagivTech’s opinion or endorsement of any subject matter and should not be relied upon as such. SagivTech and its affiliates accept no responsibility for any consequences whatsoever arising from use of such content. You acknowledge that any use of the content is at your own risk.