Linux Benchmarking HOWTO: The Linux Benchmarking Toolkit (LBT)

3. The Linux Benchmarking Toolkit (LBT)

I will propose a basic benchmarking toolkit for Linux. This is a preliminary version of a comprehensive Linux Benchmarking Toolkit, to be expanded and improved. Take it for what it's worth, i.e. as a proposal. If you don't think it is a valid test suite, feel free to email me your critics and I will be glad to make the changes and improve it if I can. Before getting into an argument, however, read this HOWTO and the mentionned references: informed criticism is welcomed, empty criticism is not.

3.1 Rationale

This is just common sense:

It should not take a whole day to run. When it comes to comparative benchmarking (various runs), nobody wants to spend days trying to figure out the fastest setup for a given system. Ideally, the entire benchmark set should take about 15 minutes to complete on an average machine.
All source code for the software used must be freely available on the Net, for obvious reasons.
Benchmarks should provide simple figures reflecting the measured performance.
There should be a mix of synthetic benchmarks and application benchmarks (with separate results, of course).
Each synthetic benchmarks should exercise a particular subsystem to its maximum capacity.
Results of synthetic benchmarks should not be averaged into a single figure of merit (that defeats the whole idea behind synthetic benchmarks, with considerable loss of information).
Applications benchmarks should consist of commonly executed tasks on Linux systems.

3.2 Benchmark selection

I have selected five different benchmark suites, trying as much as possible to avoid overlap in the tests:

Kernel 2.0.0 (default configuration) compilation using gcc.
Whetstone version 10/03/97 (latest version by Roy Longbottom).
xbench-0.2 (with fast execution parameters).
UnixBench benchmarks version 4.01 (partial results).
BYTE Magazine's BYTEmark benchmarks beta release 2 (partial results).

For tests 4 and 5, "(partial results)" means that not all results produced by these benchmarks are considered.

3.3 Test duration

Kernel 2.0.0 compilation: 5 - 30 minutes, depending on the real performance of your system.
Whetstone: 100 seconds.
Xbench-0.2: < 1 hour.
UnixBench benchmarks version 4.01: approx. 15 minutes.
BYTE Magazine's BYTEmark benchmarks: approx. 10 minutes.

3.4 Comments

Kernel 2.0.0 compilation:

What: it is the only application benchmark in the LBT.
The code is widely available (i.e. I finally found some use for my old Linux CD-ROMs).
Most linuxers recompile the kernel quite often, so it is a significant measure of overall performance.
The kernel is large and gcc uses a large chunk of memory: attenuates L2 cache size bias with small tests.
It does frequent I/O to disk.
Test procedure: get a pristine 2.0.0 source, compile with default options (make config, press Enter repeatedly). The reported time should be the time spent on compilation i.e. after you type make zImage, not including make dep, make clean. Note that the default target architecture for the kernel is the i386, so if compiled on another architecture, gcc too should be set to cross-compile, with i386 as the target architecture.
Results: compilation time in minutes and seconds (please don't report fractions of seconds).

Whetstone:

What: measures pure floating point performance with a short, tight loop. The source (in C) is quite readable and it is very easy to see which floating-point operations are involved.
Shortest test in the LBT :-).
It's an "Old Classic" test: comparable figures are available, its flaws and shortcomings are well known.
Test procedure: the newest C source should be obtained from Aburto's site. Compile and run in double precision mode. Specify gcc and -O2 as precompiler and precompiler options, and define POSIX 1 to specify machine type.
Results: a floating-point performance figure in MWIPS.

Xbench-0.2:

What: measures X server performance.
The xStones measure provided by xbench is a weighted average of several tests indexed to an old Sun station with a single-bit-depth display. Hmmm... it is questionable as a test of modern X servers, but it's still the best tool I have found.
Test procedure: compile with -O2. We specify a few options for a shorter run: ./xbench -timegoal 3 > results/name_of_your_linux_box.out. To get the xStones rating, we must run an awk script; the simplest way is to type make summary.ms. Check the summary.ms file: the xStone rating for your system is in the last column of the line with your machine name specified during the test.
Results: an X performance figure in xStones.
Note: this test, as it stands, is outdated. It should be re-coded.

UnixBench version 4.01:

What: measures overall Unix performance. This test will exercice the file I/O and kernel multitasking performance.
I have discarded all arithmetic test results, keeping only the system-related test results.
Test procedure: make with -O2. Execute with ./Run -1 (run each test once). You will find the results in the ./results/report file. Calculate the geometric mean of the EXECL THROUGHPUT, FILECOPY 1, 2, 3, PIPE THROUGHPUT, PIPE-BASED CONTEXT SWITCHING, PROCESS CREATION, SHELL SCRIPTS and SYSTEM CALL OVERHEAD indexes.
Results: a system index.

BYTE Magazine's BYTEmark benchmarks:

What: provides a good measure of CPU performance. Here is an excerpt from the documentation: "These benchmarks are meant to expose the theoretical upper limit of the CPU, FPU, and memory architecture of a system. They cannot measure video, disk, or network throughput (those are the domains of a different set of benchmarks). You should, therefore, use the results of these tests as part, not all, of any evaluation of a system."
I have discarded the FPU test results since the Whetstone test is just as representative of FPU performance.
I have split the integer tests in two groups: those more representative of memory-cache-CPU performance and the CPU integer tests.
Test procedure: make with -O2. Run the test with ./nbench > myresults.dat or similar. Then, from myresults.dat, calculate geometric mean of STRING SORT, ASSIGNMENT and BITFIELD test indexes; this is the memory index; calculate the geometric mean of NUMERIC SORT, IDEA, HUFFMAN and FP EMULATION test indexes; this is the integer index.
Results: a memory index and an integer index calculated as explained above.

3.5 Possible improvements

The ideal benchmark suite would run in a few minutes, with synthetic benchmarks testing every subsystem separately and applications benchmarks providing results for different applications. It would also automatically generate a complete report and eventually email the report to a central database on the Web.

We are not really interested in portability here, but it should at least run on all recent (> 2.0.0) versions and flavours (i386, Alpha, Sparc...) of Linux.

If anybody has any idea about benchmarking network performance in a simple, easy and reliable way, with a short (less than 30 minutes to setup and run) test, please contact me.

3.6 LBT Report Form

Besides the tests, the benchmarking procedure would not be complete without a form describing the setup, so here it is (following the guidelines from comp.benchmarks.faq):

LINUX BENCHMARKING TOOLKIT REPORT FORM

CPU 
== 
Vendor: 
Model: 
Core clock: 
Motherboard vendor: 
Mbd. model: 
Mbd. chipset: 
Bus type: 
Bus clock: 
Cache total: 
Cache type/speed: 
SMP (number of processors):

RAM 
==== 
Total: 
Type: 
Speed:

Disk 
==== 
Vendor: 
Model: 
Size: 
Interface: 
Driver/Settings:

Video board 
=========== 
Vendor: 
Model: 
Bus:
Video RAM type: 
Video RAM total: 
X server vendor: 
X server version: 
X server chipset choice: 
Resolution/vert. refresh rate: 
Color depth:

Kernel 
===== 
Version: 
Swap size:

gcc 
=== 
Version: 
Options: 
libc version:

Test notes 
==========

RESULTS 
======== 
Linux kernel 2.0.0 Compilation Time: (minutes and seconds) 
Whetstones: results are in MWIPS. 
Xbench: results are in xstones. 
Unixbench Benchmarks 4.01 system INDEX:  
BYTEmark integer INDEX:
BYTEmark memory INDEX:

Comments* 
========= 
* This field is included for possible interpretations of the results, and as 
such, it is optional. It could be the most significant part of your report, 
though, specially if you are doing comparative benchmarking.

3.7 Network performance tests

Testing network performance is a challenging task since it involves at least two machines, a server and a client machine, hence twice the time to setup and many more variables to control, etc... On an ethernet network, I guess your best bet would be the ttcp package. (to be expanded)

3.8 SMP tests

SMP tests are another challenge, and any benchmark specifically designed for SMP testing will have a hard time proving itself valid in real-life settings, since algorithms that can take advantage of SMP are hard to come by. It seems later versions of the Linux kernel (> 2.1.30 or around that) will do "fine-grained" multiprocessing, but I have no more information than that for the moment.

According to David Niemi, " ... shell8 part of the Unixbench 4.01 benchmaksdoes a good job at comparing similar hardware/OS in SMP and UP modes."