Due: 10:30 pm Tuesday 11 December 2018
Summary: You will use hardware counters to investigate program performance.
Collaboration: You will work during lab in randomly assigned pairs. You must complete the work together in these groups, whether during or after the scheduled lab time.
Submitting: Submit files named report.pdf and references.txt to the appropriate Assignment in PioneerWeb. Be sure to follow the submission guidelines and use precisely these file names. Only one partner should submit.
In this lab, you will learn how to use hardware performance counters to monitor the execution of real programs on real, complex hardware. These counters can give you insight into how a program executes and why performance may differ between two similar versions of the same program.
You will not need to write code, check off tasks, or submit any circuit files for this lab. Instead, you will write a brief report on each of the two performance mysteries. Organize your report with two sections—one for part A, and one for part B. Within each section, answer the following questions:
Your report should include a detailed explanation of your performance “diagnosis” that relies on concepts from this course. For example, consider a hypothetical question about two versions of a program with different cache behavior. It is not enough simply to say “version one has more cache misses, so it is slower.” Use your understanding of how the processor will execute each program to explain why one version causes more cache misses, and why those cache misses will hurt performance.
Your report should also include the experimental data used as evidence in support of your explanations. You should therefore carefully record values as you proceed with your diagnoses. Your presentation of this data in the report should make it easy to identify and compare values.
Please submit your final report as a PDF. Reports in Word documents, text files, or any other format will not be accepted.
Hardware performance counters can count a wide variety of events inside a processor as it executes your programs. We will use the Linux perf tool to access the counters to determine why one version of a program runs faster than another. The perf tool has many sub-commands, but we will use just four:
perf listperf statperf prints out the number of times each counted event occurred.
perf recordperf will record which instruction was executing at the time. This can give you a sense of which parts of a program are “responsible” for the events.
perf reportperf record, use this sub-command to view the results of the execution. You can navigate by function name, then “annotate” a function to see which instructions account for what fraction of the events.
We’ll walk through an example use of each command in the next section.
We have two very similar versions of a program that have wildly different performance:
a-v1.c#include <stdlib.h>
#include <time.h>
#define ITERATIONS 10000
#define ARRAY_SIZE 32768
int main(int argc, char** argv) {
srand(time(NULL));
int data[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++) {
data[i] = rand() % 256;
}
for(int i=0; i<ITERATIONS; i++) {
int sum = 0;
for(int j=0; j<ARRAY_SIZE; j++) {
if(data[j] >= 128) {
sum += data[j];
}
}
}
return 0;
}To compile this program, copy the above code to a file a-v1.c and run the following command:
$ gcc --std=c11 -g -o a-v1 a-v1.c
a-v2.c#include <stdlib.h>
#include <time.h>
#define ITERATIONS 10000
#define ARRAY_SIZE 32768
int compare(const void* ap, const void* bp) {
int a = *(int*)ap;
int b = *(int*)bp;
if(a < b) return -1;
else if(a == b) return 0;
else return 1;
}
int main(int argc, char** argv) {
srand(time(NULL));
int data[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++) {
data[i] = rand() % 256;
}
qsort(data, ARRAY_SIZE, sizeof(int), compare);
for(int i=0; i<ITERATIONS; i++) {
int sum = 0;
for(int j=0; j<ARRAY_SIZE; j++) {
if(data[j] >= 128) {
sum += data[j];
}
}
}
return 0;
}To compile this program, copy the above code to a file a-v2.c and run the following command:
$ gcc --std=c11 -g -o a-v2 a-v2.c
Now that we have two similar versions of the program, let’s see how they perform. You can run the programs with the time command to measure runtime:
$ time ./a-v1
real 0m2.215s
user 0m2.212s
sys 0m0.000s
$ time ./a-v2
real 0m0.684s
user 0m0.676s
sys 0m0.004s
Run these programs on your workstation using the time utility and verify this coarse measure of performance.
You should pay attention to the “real” time; this is the wall clock time elapsed while the program ran. Oddly enough, version 1 of this program takes 2.5x longer to run than version 2. We’ll use performance counters to figure out why.
perfThe first thing we can check is to see if the two programs are running dramatically different numbers of instructions. First, we need to figure out which event to count.
$ perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
...
As you can see, there is an event called instructions. Type q to quit out of perf list. We’ll use this event with perf stat:
$ perf stat -e instructions ./a-v1
Performance counter stats for 'a-v1':
3,281,876,619 instructions
1.879405634 seconds time elapsed
$ perf stat -e instructions ./a-v2
Performance counter stats for 'a-v2':
3,297,032,646 instructions
0.727112668 seconds time elapsed
Use perf stat to report the instruction count and times on your workstation.
Even though version 2 runs faster, it looks like it runs slightly more instructions than version one. Look through other events in the perf list output and try them to see if you can find a huge difference. You probably won’t know what all of these events are, but many should be familiar. I am happy to answer questions about events.
perf recordOnce you find an event that’s wildly different, you may have a clue about the root cause of the performance difference.
While it doesn’t give us a good sense of why the program ran the way it did, we’ll look at the cycles event as an example. To dig in on performance events, we first use the perf record event:
$ perf record -e cycles ./a-v1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.352 MB perf.data ( samples) ]
This command runs the program and produces output in the perf.data file. We can look at the data using the perf report tool.
$ perf report
Samples: 7K of event 'cycles', Event count (approx.): 6637428286
99.94% a-v1 a-v1 [.] main
0.03% a-v1 libc-2.19.so [.] __random
0.02% a-v1 [kernel.kallsyms] [k] truncate_inode_pages_final
0.00% a-v1 [kernel.kallsyms] [k] __inode_permission
0.00% a-v1 [kernel.kallsyms] [k] native_write_msr_safe
This shows that 99.94% of instructions were executed in the program a-v1 inside the main function (not a huge surprise there). If you use the arrow keys to select the line for the main function and type Enter, you can view more detail. Select the “Annotate main” option, which gives you an output that should look something like this:
main /tmp/performance-counters/a-v1
│ addl $0x1,-0x4(%rbp)
│5c: cmpl $0x7fff,-0x4(%rbp)
│ ↑ jle 32
│ data[i] = rand() % 256;
│ }
│
│ for(int i=0; i<ITERATIONS; i++) {
│ movl $0x0,-0x8(%rbp)
│ ↓ jmp af
│ int sum = 0;
│6e: movl $0x0,-0xc(%rbp)
│ for(int i=0; i<ARRAY_SIZE; i++) {
│ movl $0x0,-0x10(%rbp)
│ ↓ jmp a2
│ if(data[i] >= 128) {
4.90 │7e: mov -0x10(%rbp),%eax
0.07 │ cltq
1.18 │ mov -0x20010(%rbp,%rax,4),%eax
19.37 │ cmp $0x7f,%eax
│ ↓ jle 9e
│ sum += data[i];
11.73 │ mov -0x10(%rbp),%eax
11.57 │ cltq
1.01 │ mov -0x20010(%rbp,%rax,4),%eax
6.95 │ add %eax,-0xc(%rbp)
│ data[i] = rand() % 256;
│ }
│
│ for(int i=0; i<ITERATIONS; i++) {
│ int sum = 0;
│ for(int i=0; i<ARRAY_SIZE; i++) {
19.40 │9e: addl $0x1,-0x10(%rbp)
13.79 │a2: cmpl $0x7fff,-0x10(%rbp)
10.02 │ ↑ jle 7e
│
│ for(int i=0; i<ARRAY_SIZE; i++) {
│ data[i] = rand() % 256;
│ }
│
│ for(int i=0; i<ITERATIONS; i++) {
│ addl $0x1,-0x8(%rbp)
0.01 │af: cmpl $0x270f,-0x8(%rbp)
│ ↑ jle 6e
│ sum += data[i];
│ }
│ }
│ }
│
│ return 0;
│ mov $0x0,%eax
│ }
│ leaveq
│ ← retq
This output interleaves assembly with the C code it was generated by. While the output is a bit hard to follow, you can generally find the C code for a set of instructions just after the instructions. The numbers in the left column show what percent of cycles were spent executing each instruction over the entire run. Once you’ve found an event that is significantly different between the two programs, this view should help you pinpoint the exact line of code or instruction where that event is occurring. When you’re done with this view, type q twice to quit out of perf report. Note that you can use this view for any event of interest, not just CPU cycles.
Using the perf tool, explain why version 1 is so much slower than version 2. You should be specific: you should identify a meaningful event that explains the difference and be able to explain why this event is occurring.
Report your findings, describe the process you used to diagnose the issue, and explain your conclusion.
Here are two versions of another program:
b-v1.c#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 16384
int data[ARRAY_SIZE][ARRAY_SIZE];
int main(int argc, char** argv) {
srand(time(NULL));
for (int i=0 ; i<ARRAY_SIZE ; i++) {
for (int j=0 ; j<ARRAY_SIZE; j++) {
data[i][j] = rand() % 256;
}
}
return 0;
}Copy this code to the file b-v1.c and compile with the command:
$ gcc --std=c11 -g -o b-v1 b-v1.c
b-v2.c#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 16384
int data[ARRAY_SIZE][ARRAY_SIZE];
int main(int argc, char** argv) {
srand(time(NULL));
for (int j=0 ; j<ARRAY_SIZE; j++) {
for (int i=0 ; i<ARRAY_SIZE ; i++) {
data[i][j] = rand() % 256;
}
}
return 0;
}Copy this code to the file b-v2.c and compile with the command:
$ gcc --std=c11 -g -o b-v2 b-v2.c
As with our first pair of programs, these two versions are similar but have very different performance:
$ time ./b-v1
real 0m2.049s
user 0m1.828s
sys 0m0.216s
$ time ./b-v2
real 0m3.765s
user 0m3.544s
sys 0m0.216s
Use hardware performance counters and the perf tool to identify the root cause of the performance difference. It is a given that the two programs will run for different numbers of cycles; your explanation will need to give a sense of why this is happening. If you can find the place where the programs differ in their performance you may be able to narrow down the set of possible explanations.
Report your findings, describe the process you used to diagnose the issue, and explain your conclusion.
Copyright © 2018 Charlie Curtsinger and Jerod Weinman
This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.