Hardware Performance Counters

Due: 10:30 pm Tuesday 11 December 2018
Summary: You will use hardware counters to investigate program performance.
Collaboration: You will work during lab in randomly assigned pairs. You must complete the work together in these groups, whether during or after the scheduled lab time.
Submitting: Submit files named report.pdf and references.txt to the appropriate Assignment in PioneerWeb. Be sure to follow the submission guidelines and use precisely these file names. Only one partner should submit.

Overview

In this lab, you will learn how to use hardware performance counters to monitor the execution of real programs on real, complex hardware. These counters can give you insight into how a program executes and why performance may differ between two similar versions of the same program.

What to turn in

You will not need to write code, check off tasks, or submit any circuit files for this lab. Instead, you will write a brief report on each of the two performance mysteries. Organize your report with two sections—one for part A, and one for part B. Within each section, answer the following questions:

What makes the two versions of the program different? (3–4 sentences)
Why does that result in a performance difference? (5+ sentences)
How did you establish this? (4–5 sentences)

Your report should include a detailed explanation of your performance “diagnosis” that relies on concepts from this course. For example, consider a hypothetical question about two versions of a program with different cache behavior. It is not enough simply to say “version one has more cache misses, so it is slower.” Use your understanding of how the processor will execute each program to explain why one version causes more cache misses, and why those cache misses will hurt performance.

Your report should also include the experimental data used as evidence in support of your explanations. You should therefore carefully record values as you proceed with your diagnoses. Your presentation of this data in the report should make it easy to identify and compare values.

Please submit your final report as a PDF. Reports in Word documents, text files, or any other format will not be accepted.

Introduction

Hardware performance counters can count a wide variety of events inside a processor as it executes your programs. We will use the Linux perf tool to access the counters to determine why one version of a program runs faster than another. The perf tool has many sub-commands, but we will use just four:

perf list: This sub-command lists the events available on the current machine. Use this to figure out what kinds of events you can count.
perf stat: This sub-command allows you to run a program with one or more events counted during the execution. When the program finishes, perf prints out the number of times each counted event occurred.
perf record: This sub-command records not only how many events occurred as the program executed, but also where some fraction of those events happened. One out of every, say, 10,000 occurrences of an event, perf will record which instruction was executing at the time. This can give you a sense of which parts of a program are “responsible” for the events.
perf report: After running perf record, use this sub-command to view the results of the execution. You can navigate by function name, then “annotate” a function to see which instructions account for what fraction of the events.

We’ll walk through an example use of each command in the next section.

Part A: Mystery Program 1

We have two very similar versions of a program that have wildly different performance:

Version 1: `a-v1.c`

#include <stdlib.h>
#include <time.h>

#define ITERATIONS 10000
#define ARRAY_SIZE 32768

int main(int argc, char** argv) {
  srand(time(NULL));

  int data[ARRAY_SIZE];
  
  for(int i=0; i<ARRAY_SIZE; i++) {
    data[i] = rand() % 256;
  }

  for(int i=0; i<ITERATIONS; i++) {
    int sum = 0;
    for(int j=0; j<ARRAY_SIZE; j++) {
      if(data[j] >= 128) {
        sum += data[j];
      }
    }
  }

  return 0;
}

To compile this program, copy the above code to a file a-v1.c and run the following command:

$ gcc --std=c11 -g -o a-v1 a-v1.c

Version 2: `a-v2.c`

#include <stdlib.h>
#include <time.h>

#define ITERATIONS 10000
#define ARRAY_SIZE 32768

int compare(const void* ap, const void* bp) {
  int a = *(int*)ap;
  int b = *(int*)bp;
  if(a < b) return -1;
  else if(a == b) return 0;
  else return 1;
}

int main(int argc, char** argv) {
  srand(time(NULL));

  int data[ARRAY_SIZE];
  
  for(int i=0; i<ARRAY_SIZE; i++) {
    data[i] = rand() % 256;
  }

  qsort(data, ARRAY_SIZE, sizeof(int), compare);

  for(int i=0; i<ITERATIONS; i++) {
    int sum = 0;
    for(int j=0; j<ARRAY_SIZE; j++) {
      if(data[j] >= 128) {
        sum += data[j];
      }
    }
  }

  return 0;
}

To compile this program, copy the above code to a file a-v2.c and run the following command:

$ gcc --std=c11 -g -o a-v2 a-v2.c

A Performance Mystery

Now that we have two similar versions of the program, let’s see how they perform. You can run the programs with the time command to measure runtime:

$ time ./a-v1
real    0m2.215s
user    0m2.212s
sys 0m0.000s

$ time ./a-v2
real    0m0.684s
user    0m0.676s
sys 0m0.004s

Run these programs on your workstation using the time utility and verify this coarse measure of performance.

You should pay attention to the “real” time; this is the wall clock time elapsed while the program ran. Oddly enough, version 1 of this program takes 2.5x longer to run than version 2. We’ll use performance counters to figure out why.

Using `perf`

The first thing we can check is to see if the two programs are running dramatically different numbers of instructions. First, we need to figure out which event to count.

$ perf list
List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
...

As you can see, there is an event called instructions. Type q to quit out of perf list. We’ll use this event with perf stat:

$ perf stat -e instructions ./a-v1

Performance counter stats for 'a-v1':

    3,281,876,619      instructions             

      1.879405634 seconds time elapsed

$ perf stat -e instructions ./a-v2

Performance counter stats for 'a-v2':

    3,297,032,646      instructions             

      0.727112668 seconds time elapsed

Use perf stat to report the instruction count and times on your workstation.

Even though version 2 runs faster, it looks like it runs slightly more instructions than version one. Look through other events in the perf list output and try them to see if you can find a huge difference. You probably won’t know what all of these events are, but many should be familiar. I am happy to answer questions about events.

Digging in with `perf record`

Once you find an event that’s wildly different, you may have a clue about the root cause of the performance difference.

While it doesn’t give us a good sense of why the program ran the way it did, we’ll look at the cycles event as an example. To dig in on performance events, we first use the perf record event:

$ perf record -e cycles ./a-v1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.352 MB perf.data ( samples) ]

This command runs the program and produces output in the perf.data file. We can look at the data using the perf report tool.

$ perf report
Samples: 7K of event 'cycles', Event count (approx.): 6637428286
  99.94%  a-v1  a-v1               [.] main
   0.03%  a-v1  libc-2.19.so       [.] __random
   0.02%  a-v1  [kernel.kallsyms]  [k] truncate_inode_pages_final
   0.00%  a-v1  [kernel.kallsyms]  [k] __inode_permission
   0.00%  a-v1  [kernel.kallsyms]  [k] native_write_msr_safe

This shows that 99.94% of instructions were executed in the program a-v1 inside the main function (not a huge surprise there). If you use the arrow keys to select the line for the main function and type Enter, you can view more detail. Select the “Annotate main” option, which gives you an output that should look something like this:

main  /tmp/performance-counters/a-v1
       │      addl   $0x1,-0x4(%rbp)
       │5c:   cmpl   $0x7fff,-0x4(%rbp)
       │    ↑ jle    32
       │        data[i] = rand() % 256;
       │      }
       │
       │      for(int i=0; i<ITERATIONS; i++) {
       │      movl   $0x0,-0x8(%rbp)
       │    ↓ jmp    af
       │        int sum = 0;
       │6e:   movl   $0x0,-0xc(%rbp)
       │        for(int i=0; i<ARRAY_SIZE; i++) {
       │      movl   $0x0,-0x10(%rbp)
       │    ↓ jmp    a2
       │          if(data[i] >= 128) {
  4.90 │7e:   mov    -0x10(%rbp),%eax
  0.07 │      cltq
  1.18 │      mov    -0x20010(%rbp,%rax,4),%eax
 19.37 │      cmp    $0x7f,%eax
       │    ↓ jle    9e
       │            sum += data[i];
 11.73 │      mov    -0x10(%rbp),%eax
 11.57 │      cltq
  1.01 │      mov    -0x20010(%rbp,%rax,4),%eax
  6.95 │      add    %eax,-0xc(%rbp)
       │        data[i] = rand() % 256;
       │      }
       │
       │      for(int i=0; i<ITERATIONS; i++) {
       │        int sum = 0;
       │        for(int i=0; i<ARRAY_SIZE; i++) {
 19.40 │9e:   addl   $0x1,-0x10(%rbp)
 13.79 │a2:   cmpl   $0x7fff,-0x10(%rbp)
 10.02 │    ↑ jle    7e
       │
       │      for(int i=0; i<ARRAY_SIZE; i++) {
       │        data[i] = rand() % 256;
       │      }
       │
       │      for(int i=0; i<ITERATIONS; i++) {
       │      addl   $0x1,-0x8(%rbp)
  0.01 │af:   cmpl   $0x270f,-0x8(%rbp)
       │    ↑ jle    6e
       │            sum += data[i];
       │          }
       │        }
       │      }
       │
       │      return 0;
       │      mov    $0x0,%eax
       │    }
       │      leaveq
       │    ← retq

This output interleaves assembly with the C code it was generated by. While the output is a bit hard to follow, you can generally find the C code for a set of instructions just after the instructions. The numbers in the left column show what percent of cycles were spent executing each instruction over the entire run. Once you’ve found an event that is significantly different between the two programs, this view should help you pinpoint the exact line of code or instruction where that event is occurring. When you’re done with this view, type q twice to quit out of perf report. Note that you can use this view for any event of interest, not just CPU cycles.

Your Task

Using the perf tool, explain why version 1 is so much slower than version 2. You should be specific: you should identify a meaningful event that explains the difference and be able to explain why this event is occurring.

Report your findings, describe the process you used to diagnose the issue, and explain your conclusion.

Part B: A second challenge

Here are two versions of another program:

Version 1: `b-v1.c`

#include <stdlib.h>
#include <time.h>

#define ARRAY_SIZE 16384

int data[ARRAY_SIZE][ARRAY_SIZE];

int main(int argc, char** argv) {
  srand(time(NULL));

  for (int i=0 ; i<ARRAY_SIZE ; i++) {
    for (int j=0 ; j<ARRAY_SIZE; j++) {
      data[i][j] = rand() % 256;
    }
  }
  return 0;
}

Copy this code to the file b-v1.c and compile with the command:

$ gcc --std=c11 -g -o b-v1 b-v1.c

Version 2: `b-v2.c`

#include <stdlib.h>
#include <time.h>

#define ARRAY_SIZE 16384

int data[ARRAY_SIZE][ARRAY_SIZE];

int main(int argc, char** argv) {
  srand(time(NULL));

  for (int j=0 ; j<ARRAY_SIZE; j++) {
    for (int i=0 ; i<ARRAY_SIZE ; i++) {
      data[i][j] = rand() % 256;
    }
  }
  return 0;
}

Copy this code to the file b-v2.c and compile with the command:

$ gcc --std=c11 -g -o b-v2 b-v2.c

Your Task

As with our first pair of programs, these two versions are similar but have very different performance:

$ time ./b-v1
real   0m2.049s
user   0m1.828s
sys    0m0.216s

$ time ./b-v2
real   0m3.765s
user   0m3.544s
sys    0m0.216s

Use hardware performance counters and the perf tool to identify the root cause of the performance difference. It is a given that the two programs will run for different numbers of cycles; your explanation will need to give a sense of why this is happening. If you can find the place where the programs differ in their performance you may be able to narrow down the set of possible explanations.

Report your findings, describe the process you used to diagnose the issue, and explain your conclusion.

CC-BY-NC-SA This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Overview

What to turn in

Introduction

Part A: Mystery Program 1

Version 1: a-v1.c

Version 2: a-v2.c

A Performance Mystery

Using perf

Digging in with perf record

Your Task

Part B: A second challenge

Version 1: b-v1.c

Version 2: b-v2.c

Your Task

Version 1: `a-v1.c`

Version 2: `a-v2.c`

Using `perf`

Digging in with `perf record`

Version 1: `b-v1.c`

Version 2: `b-v2.c`