|
Our question
of whether you trust BYTEmark benchmarks more than SPEC benchmarks
generated a lot of mail and one thing is clear - benchmarks
are part faith and alchemy, and not all science. There are apparently
many variables that can effect how a benchmark program will
perform and most programs can only give you a rough approximation
of comparative performance. Application benchmarks are in general
a better indicator than synthetic benchmarks such as BYTEmark
and SPEC. But even application benchmarks can give a somewhat
misleading picture of comparative performance depending on what
hardware or OS is being stressed. Benchmarks are a crude roadmap
But even a crude roadmap is better than none when you are trying
to make a purchasing decision. Consult as many sources of performance
information as possible when when considering a purchase (we
have listed all the sites we know of with benchmark and performance
information on our links
page) and remember that the only benchmarks that really
count are the ones you perform when you run your hardware through
the tasks you need to get done. If you are not satisfied most
products come with a 30 day money-back guarantee!
Below we post some of the responses we got to our BYTEmark
vs SPEC question
Thanks for running some numbers. As you
note at the beginning, SPEC was designed to keep the processor
turning. In other words, these tests are designed not to cause
a backlog in the register-starved and low cache Pentiums and
Pentium IIs, so that the maximum amount of processor time
is available. They go up to the point, but do not cross the
line where the x86 architecture has trouble. This is done
by keeping the data flow into the processor at a low level
and keeping a steady flow of instructions that use the same
data over and over again.
On the other hand, ByteMark tests are designed to cover a
continuum of cases from the above situation to where even
32 register systems are stressed. These tests focus not on
keeping the processor going, but on how much traffic can get
through to it. Traffic jams in the feeding system, causes
down time for the processor.
Real world tests show similarity to both types of tests,
sometimes favoring the SPEC and sometimes favoring the ByteMark,
but mostly in between.
Another thing to remember is that the weak point of the G3
is the FPU. The G3 is a major improvement over the 603 in
the integer suite, but not in the FPU. It also has improved
caching schemes to support the chip. It is integer tests,
under Byte Mark conditions that G3s perform at about twice
the level as the Pentium II. Floating point tests (like your
Mathematica 3.0 tests), generally produce speeds of between
1 and 1.5 of the Pentium II at the same MHz. Also, loading
and storing of registers is an "integer" function
in Pentiums, but G3s have a separate execution unit for loading
and storing. Loading and storing not only occurs more frequently
for the Pentiums, it also takes a bigger hit for these functions
than G3s do.
In answer to your question of which is a better test, it
is better to do both and understand that these are the extremes.
Each real world application will almost always fall in between,
depending on how much data needs to go to the processor.
The fact remains that under ideal situations, where data
flows easily to the processor, the G3 still gets about 50%
better integer performance and about the same FP performance.
When the data flows increase, the Pentium II's feeding system
gets overwealmed well before the G3s do. For integer tests,
the G3 is about 50% faster during ideal flow situations, and
often more than 100% faster at higher data feeding rates.
A couple of more things about your tests you might think
about.
Your 300 MHz systems both have 66 MHz front-end buses (ratio
of about 4.5:1), thus memory to processor/cache speeds are
the same.
I don't think this is true of your 400 MHz systems. The Pentium
II/400s use a dual 50 MHz bus (similar to the 2x50 MHz bus
on the MACh 5 604e and in the big picture have almost the
same throughput as a 100 MHz bus). These have an effective
4:1 ratio. The 400 MHz G3 probably still has a 66 MHz for
a 6:1 ratio, although they may have a 72 MHz for a 5:1 ratio.
Either way flow from memory to the cache/processor is faster
on the Pentium II than the G-3. Thus you can expect the G-3
to take a small hit for this mis-match.
I understand that the next generation systems (Katmai Pentium
IIs and AltiVec G-4s) will have true 100 MHz buses early next
year. But while the Pentium II adds Katmai's New Instructions
(for 3-D FP calculations); the G-4, will also have a real
FPU (ala the 604e), still better feeding mechanisms (up to
2 Mb L2 Cache), AltiVec (superior to KNI) and improved multiprocessor
capability. If the G-3s do better than the Pentium IIs, now
just wait until February and watch the gap widen, this time
particularly on the floating point side.
Mike Johnson
Neither benchmark is particularly effective.
Especially for the last generation or two of processors,
and even more so in the future, processor speed is very tightly
linked to memory architecture. Memory bandwidth starvation
easily dominates all other factors when it is present. So
for real world performance results, a benchmark must move
reasonably large amounts of data to and from memory, even
when focusing on just the CPU, because memory bandwidth to
CPU is the critical performance factor right now.
How big is the memory footprint of Bytemark? Is it greater
than the 1 MB of many caches today? More than that, the Bytemark
is extraordinarily sensitive to compiler fiddlings. It is
much more a toy benchmark than SPEC, which is the result of
many years of industry benchmarking experience. Certain SPEC
results require standard compiler optimizations, and all SPEC
results require published compiler settings. This avoids the
abuse of compiler vendors recognizing the benchmark and generating
special code just for it (which would give it a benefit you
couldn't realize in real programs, unless your real programs
exactly correspond to the benchmark). There is a whole benchmarking
FAQ which you should read to get an idea of how people lie
using benchmarks (including Apple).
But is SPEC any better? Yes, in some ways. It has a larger
body of code, which are pieces of commonly used programs _under
Unix_. The SPEC body has a lot of rules to safeguard the credibility
of SPEC results.
No, in other ways. SPEC is expensive, so it isn't available
to the general hobbyist. And it only runs under Unix. Whose
SPEC results are you quoting? How many of the G3 SPECs were
produced on PowerMacs? None were produced under MacOS; how
do you know if the speed of the G3 under some flavor of Unix
(AIX, LinuxPPC, MkLinux) corresponds in any way to its speed
under MacOS? And if the SPEC was produced on an IBM workstation,
how can you use results generated with a completely different
memory hierarchy and hardware? If the SPEC is merely estimated,
what the heck does that mean?
Now, since we persist in wanting to compare processors, and
Macs against PCs, we need some method of comparison. Do what
most reputable people do: compare benchmarks, both Bytemarks
and SPEC and whatever else comes along. Note their deficiencies,
and balance those results with the following application comparisons.
Then compare performance in popular applications; try hard
to find parts of the applications that are performance bound.
For example, scrolling is often a poor test, because the scroll
speed is often limited not by the processor but by the application
code (to avoid scrolling too fast).
What are you asking for in a CPU benchmark? Perhaps this:
given equivalent supporting hardware (memory, logic set, etc),
which CPU performs best. This probably varies for different
performance parameters of the supporting hardware. So, then,
maybe this: which CPU has the most potential speed, given
ideal supporting hardware? But what if a CPU requires much
more complex support logic to provide the same speed? What
if that level of support logic is not available, except on
reference machines, and most machines use far inferior support
logic?
What if comparing CPU benchmarks is like comparing NBA players
based upon their free-throw ability, or comparing swimmers
based upon their lung capacity, or comparing runners based
on the power of their hamstring muscles? In other words, it
produces results almost totally unrelated to the original
question, and ignores many dominant factors.
Food for thought: no current benchmark measures interface
response time, which is what we perceive as speed. We throw
raw performance at the problem and assume that will solve
our problem, while application and OS developers are making
the problem worse with each update.
-- MattLangford
There are 2 ways to evaluate a system: Low
level & High level.
Low level analysis looks at moving bytes, modifying memory,
and very basic activities like sorting or complex math (integer,
floating point, array...).
High level analysis looks at things like how fast the system
boots, how fast a specific program like Mathematica or Word
runs, or how fast the Browser moves.
I think that the Bytemark is the better low level benchmark.
It seems to test more generic system activities like sorting
& bit twidling.
The SPEC benchmark seems more complex than the Bytemark and
I don't trust it as much because it seems so tied to the compliler
and algorithms. If I were on the Wintel side I'd dedicate
substatial resources to make sure that compilers would be
optimised to deliver maximum performance for these obscure
problems. I'd bet that the Weather Service doesn't use the
same compiler & hardware that this benchmark is run on.
How many home users would use this weather simulation? This
specifice test, sexy as it sounds, is irrelevant. I bet that
given enough money any of the tested platforms could be optimized
to score highest. A bad compliler & algorithm could kill
even the best hardware.
It's like comparing cars: tell me the horsepower, the torque,
the type of suspention, not how fast driver X could drive
it through his favorite country road. I think the Bytemark
is the more accurate test of a systems raw potential.
Hi, I would just like you to know that I
believe the BYTEmark tests are more accurate than the SPEC
tests. I own a 333 G3 and compared it to a Hewlett Packard
333 Pentium MMX and there is simply no comparison. It is literally
like comparing night and day. Everything from opening windows
to performing complex tasks in Photoshop are faster. Anyone
who thinks that Wintel boxes are faster, or that there is
no noticable difference, should sit down behind a new G3 and
do some work. Not just scroll Word documents or open windows,
but acually do some "real" work. I think that they
would find that the G3 processor is far better than anything
Intel is producing, whether it's a PII or MMX.
Wade
.....from my real-life experience I can tell that the Pentium
MMX 200 (64 MB RAM) I use everyday on Win98 is really a dead
snail compared to the 604e 200 MHz (128 MB RAM) on OS 8.5
I use as well every day. In some cases even Virtual PC on
this Mac is faster than the real Pentium on Windows (Director
projectors play faster on VPC). And since this is a real life
comparison between 604e and a Pentium MMX I guess that it
won't be that different when comparing the next generation
processors on both platforms: a 750 with a Pentium II. Actually,
I do believe that the performance increase from 604e to 750
(maybe except for the FPU which seems to be performing almost
equally on 604e and 750) is greater than from Pentium MMX
to Pentium II.
Kilian
I don't completely trust either test, but
I am more skeptical about the SPECMark bench.
The reason for my skeptisizm of the SPECMark suite is the
fact that it is easy for INTEL to write a compiler which is
optimized for this test suite. Intel has done this.
It has been noted by several magazines how Intels SPECMark
performance numbers suddenly jump over 20% !!! SPECMark numbers
for the Pentium 60 and 66MHz processors went up over 20% despite
the fact that they were no longer on the market. So how does
a processor jump 20% in performance even though it is not
in production? Simple, Intel started using an "internal"
compiler which was optimized for the SPECMark suite. It is
interesting to note that INTEL would not ANYONE to use the
compiler or test it against the results of another compiler.
jason h
I wanted to reply to some of the other quotes
you posted on your site, because I believe they were misinformed.
I'll try this once, and then probably let it drop.
Mike Johnson writes, "SPEC was designed to keep the
processor turning. In other words, these tests are designed
not to cause a backlog in the register-starved and low cache
Pentiums and Pentium IIs, so that the maximum amount of processor
time is available."
SPEC has been around long before the Pentiums; it was NOT
designed with the x86 family in mind, nor does it especially
favor them. Its roots are from Unix workstations, VAX machines
(and an older sorry benchmark, the MIP), and Cray supercomputers,
far from the WinTel world.
"This is done by keeping the data flow into the processor
at a low level and keeping a steady flow of instructions that
use the same data over and over again."
This might be a general way of saying "cache-friendly"
or that it doesn't stress the memory hierarchy (L1 cache,
backside/L2, etc) too much. Neither benchmark does that particularly
well, but SPEC isn't worse than Bytemark in this regard. For
example, part of SPEC is the gcc compiler, which can stress
the memory hierarchy more than most programs. "A steady
flow of instructions that use the same data over and over
again" is true on parts of both benchmarks, and false
on other parts.
I agree with Mr. Johnson's comments about FPU performance,
and also about the memory buses. In general, though, sharp
PC manufacturers such as Dell tend to incorporate higher speed
buses to memory sooner than Apple. If a 66 MHz is all Apple
is selling, but Dell already has a 100 MHz Bus (such as the
Dimension XPS-R systems right now), isn't it fair to use currently
shipping boxes?
"Real world performance" has little in common with
a CPU benchmark; in the real world it is both possible and
common for Pentium systems running Windows to far outperform
the G3 PowerMac running MacOS--Microsoft Excel, for instance.
The point I'm making isn't that one system is faster, it's
that you don't use CPU benchmarks to compare systems. If you
are trying to compare CPUs, use CPU benchmarks. If you are
comparing complete computer systems running certain applications,
compare the speed of those applications on those systems.
Similarly, "Wade" and "Kilian" use their
real-world systems comparisons to validate a CPU benchmark?
This is nonsensical. Using a CPU benchmark to compare Macs
vs. PCs is misusing the benchmark. Use it to compare Pentiums
vs. PPCs only, and don't generalize the results. It's great
that in the applications they use daily the Macs are faster--that
is all the benchmark you need. Such a result is far more telling
than a toy CPU benchmark.
But don't assume that such experience means that the PPC
is generally faster than the Pentium or Pentium II. Perhaps
the applications or file formats or graphics/movie playing
software were better written for one platform. Perhaps they
stress something other than the processor. Perhaps the observers
were noticing things that have nothing to do with CPU speed
at all (user interface response time, graphics card drawing
speed, disk speed, virtual memory performance, and so on).
Perhaps they are more skilled at optimizing one system over
the other, or perhaps they are willing to spend more on one
system than the other. Perhaps they are so emotionally invested
in the advantages of one system that they minimize its shortcomings
and key on its superiorities--ignore the waits, and harp on
the quicknesses. None of this has anything to do with the
performance of the CPU. But it is more relevant than the CPU
speed in making a computer system purchase.
Lastly, the section below my quote and jason h both claim
that SPEC is more sensitive to compiler opts than Bytemark.
This is not true. The "20%" Intel jump on SPEC results
was an Intel mistake, which they admitted and settled claims
for. The Intel reference compiler is available, despite jason
h's claims. If he can pony up the serious cash, he can buy
a copy.
It is not easy to write a compiler just for SPEC; in fact,
there are several provisions of SPEC which make this difficult,
as I mentioned in my first post. (System specification, using
published commandline switches on a commercially available
compiler, using real application cores, using many different
cores, etc.) On the other hand, if a benchmark is often used
to compare systems, it behooves the compiler writer and hardware
manufacturer to do as good as possible on the benchmark.
A more apt criticism is that _I_ can verify Bytemarks by
downloading and running it. Whether my verification is worth
anything _to a CPU benchmark_ using a cheap compiler and probably
cheap hardware and not being skilled in setting up either
system is another question. It may well be worth more to me
in making a computer _system_ purchase, though, since I'll
be using my cheap compiler (hoping the programs I use don't
though) and cheap hardware and lack of skill. But that doesn't
mean the CPU is slower because I can't take advantage of it,
right?
Or another apt criticism, which the person quoted below me
seems to make, is that SPEC is composed of Unix application
cores. These _are_ commonly used Unix apps, or demanding Unix
apps, but have little to do with what the average PC or Mac
users are running on their boxes. But then, Bytemark isn't
too much more applicable.
Question: SPEC maintains the SPEC benchmark, updating it
when it becomes too small to adequately stress CPUs, protecting
it from outrageous abuses, and so on. Who is maintaining Bytemark?
Matt
Hi-
The only perfect benchmark is the program you plan to run,
because ultimately the only metric which really matters is
how much time you spend waiting. If you're one of those few
people whose principle waster of time is a program of your
own devising, then you can really test different hardware
(and OSs and compilers) head to head. Similarly if you spend
all of your time in Photoshop or Premiere, that should be
your benchmark.
Failing such direct tests, the best you can do is figure
out which of the many synthetic benchmarks is the most like
your application. If you do scientific programming, then SPEC
is a decent benchmark, because it's composed of little scientific
problems. For other large computing taks (where you leave
the machine alone for hours), SPEC is probably also a decent
benchmark. So when I'm buying workstations I pay attention
to SPEC or even LINPACK.
ByteMarks are designed more like the little computing tasks
which happen between mouse clicks, so this might be a better
benchmark for "conventional" applications. However
in this regard its UI independance is a liability, because
for typical users it's the time to pull down the menu and
redraw the screen that is the limit, not a flurry of LU decompostions
(solving matrix equations). However if you built a benchmark
with UI, you're no longer testing the processor, but the whole
package (OS, disk, processor, graphics, etc.), so such benchmarks,
MacBench for example, are really only good for comparing machines
with the same OS. There is talk of using the cross-platformness
of Linux as test, but even then you're talking about different
levels of patching and optimization accross hardware lines.
Furthermore, it's unclear how Linux performance matches to
say that of Word.
In short, it's easy to say which machine runs program X faster,
whether program X is a real world application or some benchmark
made up to "simulate" a class of real world applications.
But since the answers are different for each X, the notion
of a faster machine depends heavily on what you do with it.
There is no "one size fits all" way to rank machines.
Raph
Benchmarks are very difficult things to
compare - it's like comparing houses or cars. There are so
many different things to measure, and a great deal of honest
difference of opinion on what's important. There's also a
long history of fakery, mismeasurement, and puffery as well.
The SPEC suite of benchmarks is a well-tested, public (more
or less) set that's been run on many, many processors, going
back into the 1980s. It was one of the first benchmark sets
to demand that everyone must report all the sub-measurements,
and document exactly what was tested. It's a benchmark of
processor + compiler mostly, not testing, for example, graphics
performance or disk performance. The best use of SPEC is probably
when you're intending to select a processor for a board you're
going to design, or to get a rough idea of where a processor
compares to many others. SPECmarks can improve by using a
better compiler or memory system. This means that there really
isn't one SPECmark number for a given processor - you really
must include the particular motherboard and compiler used
as part of the system that's being tested. A historical problem
with SPEC on Intel has been that Intel measures CPUs on specially
built motherboards, with extremely expensive memory systems,
and a compiler that is used for nothing but benchmarks. I
don't know what they do now. Other vendors tend to use something
more like what a user might buy.
ByteMarks are a different set that also attempts to measure
CPU performance, typically run on real systems that you can
purchase. It hasn't been as widely tested. The lack of testing
means two things - it may have some flaws that SPEC doesn't,
and the results haven't been cooked by manufacturers as much.
As an analogy, imagine benchmarks on who builds the fastest
car - Fevy, Chord, or Volksiat. Fevy says they have the fastest,
because their engineers have shown on a dynamometer that their
engine is the most powerful. Chord disagrees, saying that
their cars all can go from 0-60 in 4 seconds. Volksiat says
both are wrong, that the better aerodynamics of their cars
means that, even with less horsepower, it can reach a top
speed higher than either Fevy or Chord. Who's right? It depends
on what your particular need is, just as it does with computers.
You can tell that all the cars are fast, probably much faster
than every economy car.
On the particular question of whether Pentiums or PowerPCs
are faster, both sets of benchmarks have some value. Both
seem to show that PowerPCs are somewhat faster at the same
clock speed, and that both Pentiums and PowerPCs can run mighty
fast. That's probably all you can confidently say.
Bill Paulson
Internal Links
External Links
|