Tags

, , , , , , ,

Profiling

Profiling is about analysis and optimization of some program constraint, which may be memory, CPU cycles and similar measures. Although the first measure comes to our and the white rabbit’s mind is execution speed, we may sometimes favor other constraints also. Memory limited embedded platforms may prefer memory optimization in trade for  slightly longer execution time, as an example. Profilers tries to observe programming performance measures and reports them to programmers to make tuning as their wishes.

Here, the critical word is “observing”. Starting with introduction of relativity, we know that observing something may have strange behavior that can be named as changing the actual thing that we try to observe.

We will have a simple program to profile. It simply generates pseudo rtp flows. First it generates single tone packets of 20 ms, encodes these in g711 / g722, after that time stamp, ssrc and other fields are added to form rtp packet. These rtp packets are encoded in udp and ip. Finally, these are either send through network (to simulate drlink active recording), or saved as pcap files (to simulate offline passive recording). This simple program is prepared to test our VoIP voice recorder SistasSVR, and may be investigated at https://github.com/sifaserdarozen/DDGen. Knowing internals, I expect bottlenecks found to be in encoding, tone generating and checksum calculating parts, but let’s see if this is the case.

Stack sampling

A simple and fast method may be “stack sampling”. If you are able to sample your stack, probably at times you think your program has a slowdown, then you may spot roughly the place your program needs modification. Following Nyquist and Bayes, if you sample frequent enough, you may get a pretty clear picture. This method is also called “poor man’s profiler” and worth to have a look at  http://poormansprofiler.org/

Just do not forget to compile using debug symbols “-g” and have a stack snapshot of your program. for example, here “ddgen” is started to generate 500 simultaneous calls, and generated ip packets are send to voice recorder drlink ports ().

profiler - ddgen program execution

It is seen that program takes about 65% of cpu cycles of vm.

profiler - top pid and cpu usage of ddgen

Manual samples may be taken by passing process id to gdb

profiler - manual sampling stack of ddgen

This basch script is taken from,http://poormansprofiler.org/. Slightly modifying will do the job.

profiler - poor mans profiler script

and we see that encoder is higher in the list.

profiler - script sampling stack of ddgen

Sampling profilers may be divided into two categories, time sampling and cpu sampling. Cpu samplers take snapshot of the stack at regular cpu instructions, therefore they emphasise cpu bottlenecked portions. On the other hand, time samplers take snapshot of stack at regular time intervals. As a result time samplers are able to detect input / output, cpu, shared resource locks and cache issues. Poor man’s profiler is an example of time sampling, and whenever feasible, time profilers should be preferred.

Perf

perf being a system wide utility, may also be used to get program specific details as we will see. First check perf is available, else install or better make puppet install it, if you use configuration manager such as vagrant.


profiler - installing linux-tools

First monitor system, as an example, for all cpus (-a), get the stack trace (-g),  for duration of 60 seconds,

profiler - perf report generating

Then display the results. We see that major contributor to ddgen execution cycles are from math functions and g711 encoder.

profiler - perf report

Perf is bases on perf_events coming from linux kernel. It tries to abstract away cpu differences and uses a envent based sampling. At event counter overflow, an interrupt is scheduled, and program counter is recorded. More information may be obtained from https://perf.wiki.kernel.org/index.php/Tutorial and Brendan Gregg’s blog. Mr Gregg has awesome videos at youtube focusing system profiling such as: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk

It is also possible to collect information about a single process, instead of whole system, by giving a process identifier as;


profiler - perf process id

Gprof

gprof is in binutils package. It is called as call graph profiler and in order use it, files should be compiled and linked with “-pg” directive.

Program is executed as usual. Outputs are collected to gmon.out

profiler - gprof execution

Resulting call graph may be seen through “gprof executable gmon.out”, and if it is desired to redirect to a file “gpref executable gmon.out > gprof_results.txt”

profiler - gprof report

We see that tone generation and encoding parts take most of the cycles. However, math function sin is hidden from hotspot.

valgrind

Valgrind is a well known dynamic analysis toolset. Here, we will use callgrind option of valgrind as profiler. Callgrind  runs application in a virtual processor emulated by valgrind. Result is fairly good precision on profiling data even on small run time, since every instance is being measured instead of sampling. Downside of this emulation is huge run time overhead, more than 10 times slowdown is not uncommon.

To start just run program in callgrind.


profiler - valgrind profile

As can be seen from the upper result, the output is pretty cryptic. We are luck having a tool to display in more user friendly way; kcachegrind. Opening the log file using kcachegrind will reveal following;

profiler - valgrind kcachegrind results

Kcachegring also privides a pretty good call graph with associated costs;

profiler - valgrind kcachegrind callgrind

Results

Although there are variations between profilers used, when we investigate the results, we see that bottlenecks are sinusoidal generation and encoding parts. It is seen that my apriori assumption about cost of checksum calculation parts is not significant indeed.

 

Advertisements