Lab: IO Buffering
CSC 213 - Operating Systems and Parallel Algorithms - Weinman
- Summary:
- You will measure the effects of system calls and buffering
on file I/O.
- Assigned:
- Tuesday 27 November
- Due:
- 10:30 PM Monday 3 December
- Objectives:
-
- Appreciate the cost of system calls and the advantages of buffering,
another flavor of locality.
- Practice empirical research methods with system behavior benchmarking.
- Develop quantitative and analytical writing skills.
- Collaboration:
- You will complete this lab in teams
as assigned by the instructor.
Background
*nix provides two primary ways to do I/O. The usual functions, open(2),
read(2), write(2),
and close(2) are for input and output to
files among other devices. These are system calls that trap to the
kernel, and are typically referred to as unbuffered I/O routines.
The standard C library also provides buffered I/O routines,
fopen(3), getc(3),
putc(3), fread(3),
and fclose(3) among others, that are a
further abstraction of the system calls above.
Preliminaries
- Read the man pages for the four system calls open(2),
read(2), write(2),
and close(2) to gain an understanding of
how to use them.
- Read the man pages for fopen(3), fread(3),
and fclose(3) to gain an understanding
of how to use them.
- The examples head-1.c and
head-2.c demonstrate some
of these calls (the former being unbuffered, the latter being buffered).
Read over these two examples and be sure you understand how they work.
Exercises
Part A: Reading files
- Write a program unbufread.c with a function
-
long unbufread( int fd, size_t count )
using the unbuffered I/O operation read(2)
to read through all the data in the input file associated with the
file descriptor fd. Note that in particular, your function
should use count as the number of bytes to be read each time
a call to read is made (i.e., the buffer size).
The return value should be the total number of calls made to read
or -1 if an error occurs, when you should print an appropriate
error message to stderr using fprintf(3)
or perror(3).
Note that the integral numeric type size_t is defined in
the standard header <sys/types.h>, which is included with
<unistd.h>, which read(2) requires.
- Your command-line program should then take a buffer size count and
a filename as arguments, call the function above, and report
- the number of reads made by your unbufread function until
the end of file was reached,
- the total wall clock time (see gettimeofday(3)
and timersub(3)),
- the user CPU time (see getrusage(2)), and
- the system CPU time (see getrusage(2)).
You do not need to fork any separate processes. For example,
-
$ ./unbufread 10 /tmp/amt3.log
2415 0.001778 0.000712 0.002383
$ ./unbufread 16 /etc/ssh_host_dsa_key
Unable to open input file: Permission denied
$ ./bufread 16 /blah
Unable to open input file: No such file or directory
- Write an analogous program bufread.c with a function
-
long bufread( FILE* stream, size_t size )
using the buffered library call fread(3)
to read through the input file associated with the file stream pointer
stream. Note in particular that your function should read
size bytes per call to fread.
The return value should be the total number of calls made to fread
or -1 if an error occurs, when you should print an appropriate
error message to stderr using fprintf(3)
or perror(3).
- Your command-line program should function the same as before, but
call bufread. For example,
-
$ ./bufread 10 /tmp/amt3.log
2414 0.000202 0.000798 0.001681
Part B: Empirical analysis
- To effectively test the performance of these programs, we will need
a relatively large file. To avoid swamping the home directory server
and testing network effects more than filesystem/kernel effects, you
will create this file in the /tmp directory on your workstation's
local disk; create a 100,000,000 byte file using the following command:
-
head -c100000000 /dev/urandom > /tmp/bigfile
Explain how this command works. You will need to do a bit of research
on shell commands and Linux pseudo-devices.
- Run each of your programs with read buffer sizes of 20...20
bytes. That is: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048,
4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, and 1048576.
Collect the output data in a text file or spreadsheet.
- Using your favorite graphing program (a spreadsheet, gnuplot,
MATLAB, etc.), create graphs of the (three) times per read
and times per byte for each method as the read size changes. You will
need to decide whether to use linear or logarithmic axes.
- Organize your findings in a short report that (like your image restoration
analysis) should:
- Introduce the task, question, or problem and your purpose.
- Share the details of the file you used, and anything you may have
learned about the filesystem on the machine used for the experiments.
- Present the results in an organized fashion.
- Summarize the results. Write a paragraph to describe the patterns
you see.
- Draw conclusions. Write a paragraph explaining what you can conclude
about the results. The goal of the conclusions is to try to explain
why the patterns appeared. Why do the results for read
vary so much? Why do the results for read and fread
differ? What can you conclude about how fread works?
What does the system time curve for read
tell you about the file system?
- Make recommendations. What advice would you give to a programmer (e.g.,
your future self) who needs to do I/O efficiently?
Note that performing the experiment and summarizing the results are
separate steps and both come before you draw conclusions. To present
honest and understandable results, we must present the basic data
first (so the reader can draw their own conclusions) before we insert
our bias.
What to turn in
- Your C files and a Makefile.
- A single pdf containing (merged)
- Your enscripted C programs
- Transcript of compilation and example/test runs of both programs
- Your report from Part B
Extra Credit
The following are a few extra credit possibilities.
- Measure system call overhead. Write a program that performs the same
lightweight system call repeatedly (perhaps 1000 times). Compute the
average time to establish a baseline for the time to do a system call.
Good system calls include read (from /dev/null so there is no actual
I/O), write (to /dev/null), getpid, and gettimeofday; feel free to
try others such as synchronization operations. Also provide an option
to call a function that does nothing but return, so that you can compare
the cost of a system call to the cost of an ordinary function call.
- Measure the cost of creating threads vs. forking processes. To isolate
this cost, the child should exit immediately; the parent will need
to wait until the child exits. You may go on to estimate the incremental
cost of calling exec(2)by comparing the cost of forking and immediately
exiting with the cost of forking and then calling exec(2).
- Reading a file a second time, even using exactly the same code, can
give different results due to caching. In the benchmarking world,
these are called warm cache vs. cold cache effects. Devise an experiment
to determine if such an effect is happening. If you want to try this,
please come talk with me about your ideas.
- As an alternative to writing code, find an interesting systems research
paper that reports on a benchmarking experiment. First summarize the
paper's approach and main findings, and then give your reactions.
Do they follow the experimental method? Be sure to cite your source.
Acknowledgment:
This assignment was inspired by Figure 3.1 of Advanced Programming
in the UNIX Environment, by W.
Richard Stevens. It is based on a version by Jerod Weinman given in
a prior offering of CSC 213, with some text in Part B from a similar
assignment by Janet Davis, which she credits as "adapted from
one given by Barton Miller at the University of Wisconsin," Madison,
and the extra credit "borrowed from Fred Kuhn at Washington University
in St. Louis."
The background (including head-1.c and
head-2.c) and Part A are:
Copyright © 2012 Jerod
Weinman.
This work is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License.