M.S.S. Inc LOGO The MYTH of MIPS

McColley Systems Software Inc.
If it's assembler, WE CAN HELP!
"Elegant Solutions for Your Processing Needs"


Home   Shared Spool Mods   ESSM!   Tip of the Month   Articles   Free Source   LINKS   ASSEMBLER SURVIVAL GUIDE   ABOUT US  


MIPS, everyone who is anyone in information processing knows what MIPS are. Mainframes have been using the term for roughly 25 or more years now. For the uninitiated the term MIPS is of course an acronym like so many things, this one stands for Millions of (assembler or machine) Instructions Per Second. It is the one term that is most often used to distill into a single sentence, a single graspable concept if you will, the relative processing power of one mainframe computer compared to any other mainframe computer. Every manager, salesman, and even technician knows and uses the term. It is seldom if ever safe to reduce the relative merits or capabilities of a large and complex system to a single term, yet alone a term that provides what appears to be a very specific relative index value. Beyond the fact that oversimplification is itself a significant problem, there is the underlying inconvenient fact that a MIP is not a MIP - it is a mythical numeric value prepared so that salespeople can more easily converse with the management that must make buying decisions.

Wow, what a claim. "The entire culture is wrong, this must be a bogus claim " you say well, let's get down to what any good HLASM or assembler programmer should already know. A simple fact that should be self-evident, but apparently is not, is simply this, all instructions were not created equally. Some assembler instructions do very little work, important work, but very little work. Some assembler instructions do a great deal of work. Some assembler instructions perform a variable amount of work. Well that kind of messes with the idea that the "machine" is basically a very simple machine; you load it with a program, and then for each turn of the crank, one instructions is performed, the faster you turn the crank, the more instructions get performed, or the higher your MIPS rating, and the more the salesman can justify charging the manager, who also wants to think of the processor complex as a simple easy to understand machine. IBM used to provide timing information for each of the assembler instructions, a long time ago when things were so very much simpler than they are today. They really can not do that anymore for a number of reasons, but the biggest is probably that the results are not repeatable. The time it takes to execute any single instruction can be significantly altered primarily by the system CACHE, and other clever engineering devices that are intended to speed up the processor, most of the time. For instance a feature called 'branch prediction' saves time and is right almost all of the time, but when it is wrong it can increase processing significantly. Let's take a couple of very basic, and fairly simple instructions as a basis that may be a bit easier to swallow than just taking my statements at face value. I would like to compare the " LA " load address instruction with the " MVC " instruction. Certainly these are not exotic instructions, in fact they are probably two of the most commonly used of all instructions, and have both been around since the original 360 instruction set was introduced. I contend, and can prove, that the " LA " or load address instruction operates much more quickly than the " MVC " or Move Character instruction, and by the way, I'm not talking about something like a piddly little ten or fifteen percent difference, although that would be significant, I'm talking about the LA instruction executing something like 10 TIMES or so faster than the MVC assembler instruction. So we are not talking about minor variations in the instructions, but massive ones, ones that every good assembler or HLASM programmer should be aware of so that they can design and code efficiently.

Why the significant difference in speed? Well, there are two major reasons; the first reason is simply the amount of "work" that is accomplished by each instruction, and the second is the type of work that each instruction performs, that is, what each instruction modifies, examines or changes. Let's take a quick look at each of these components.

The amount of work that each instruction performs affects it's speed. Let's see what each of the instructions does. The " LA " instruction loads an address into a single register. To determine or develop the address that is loaded, the instruction is required to "fetch" three values, and logically add them together. The LA instruction will fetch the value in a Base register and an Index register and then add both of these values to a displacement value that it gathers directly from the instruction itself. The resultant address is loaded, or put into a result or target register completing the process. Since we are quantifying the amount of work being done, we will note that the " LA " instruction deals with a 12 bit (one and a half bytes) displacement which is embedded in the instruction and the low order 31 or 24 bits of data in the registers it works with. For this explanation we will assume it deals with a full 31 bits, which is a "bit" less than a full four bytes of data. The " MVC " Move Character instruction is used to move full bytes of data from one location in storage to another place in storage. It should be noted that the MVC instruction can move anywhere from 1 to 256 bytes of data with one instruction. Let's assume we are using it to clear a print line that is 133 bytes long. To determine where the data is move from, and where it is moved to, the MVC instruction must retrieve the value of a register for each of the addresses to be developed (from and to), and the value in each of the from and to registers is added to a from and to displacement value that is again 12 bits (one and a half bytes) long. Then after the two addresses are developed, a final value, the length value, is extracted from the instruction itself. So far this is all just setup information, the value we expect from the instruction comes when we fetch a single byte at a time from the "from" location and place it in the "to" location, then the internal addresses are incremented, and the process is repeated the number of times specified in the length value of the MVC instruction. You can see that the MVC instruction does almost twice the work of the LA instruction before the first byte of data is ever retrieved from storage to be moved to it's new location. As we will see in a few sentences the really time consuming part of all of this is really the data movement, not the address arithmetic used to develop the addresses to use when moving the data. You should already see that the instructions do varying amounts of work, and when viewed this closely to the details of what is actually occurring you should be able to see that the volume of work performed by all instructions are not equal. The differences between LA and MVC are typical, they are not extreme. There are extreme examples, where specialized instructions perform one hundred or more times as much work as the simple MVC instruction we just detailed, but for the sake of this discussion we are using typical and commonly used instructions.

I also said that the differences in speed were not restricted to simply the amount of work that each instruction did, but also to the TYPE of work that each instruction does. As we look closer and closer, the matter just seems to get more and more complex. The work the instructions of our programs do, can be very basically described as; moving data, comparing data, basic mathematical functions (addition, subtraction, etc.) and control instructions, that is deciding which instruction to execute next, or possibly changing some basic control information such as the addressing mode. Even more basically described we can boil it down to accessing data - either we just get a copy of it - or we replace it. Where we get this data from, or where we update it can make a huge difference in the time required to perform a unit of work. To properly explain this component of speed we have to look at basic processor architecture for just a second - don't worry we won't look long enough to affect your vision permanently. We will start from the inside and work our way out. Inside the processor itself - which is, separate from the real memory associated with the computer, are the internal registers. Registers are used for all manner of things everything from addressing to counters, in fact the most commonly used set of registers is officially known as the "General Purpose" registers. Everything that occurs in the computer really happens here, inside the processor that contains the registers. All comparisons, all math functions, everything. Even moving data from one location in storage to another requires the data to move through the processor. A complex and extensive system of multilevel cache exists expressly for the purpose of getting the right data into and out of the processor as quickly as possible, but we only need to remember for now that everything really happens inside the processor, and everything that happens inside the processor happens very fast. The very fastest of all things happen inside the processor. Register arithmetic, because the registers are inside the processor, is one of the fastest of all processes that can occur. As we proceed beyond the processor itself we find the processor storage. I have already mentioned memory cache, and there is a difference in accessing virtual, as opposed to real memory, and most systems people can go on at length describing the paging processes, but for our purposes we can simply consider memory as a homogenous storage area outside the processor that in general takes somewhere on the order of 10 times or so as long to access as data in registers. There are obviously differences from model to model, and actual storage type etc, but the fastest possible memory location takes far longer to access than any register value - that is the salient point you need to remember. Now proceeding even further away from the processor is the channel subsystem through which we can (eventually) access data on tape or DASD storage. The same comparison holds true for external data vs. processor memory as that held between processor memory and the registers inside the processor. So we have locations where data can be accessed from, or stored in, that can be classified in order of relative speed where registers are the fastest, then processor memory, and finally external storage devices (no matter how fast) which are the slowest of all. As we move from one type of storage area to the next we can observe very significant differences in speed, no external storage device even comes close to the speed of real memory. Likewise no real memory even comes close to the speed observed when accessing data in registers.

Ok, now back to our concrete example, LA vs. MVC. Based on what we now understand about the processor architecture, the LA instructions slowest process is probably moving the data that is the LA instruction itself from storage (where the program resides) to inside the processor where it is executed. Of course the movement of the instruction from memory to inside the processor is common for all instructions, so we can safely ignore it, as long as we understand that it too is a movement of data from storage to inside the processor and it is a relatively slow process. Once the instruction has been brought into the processor the arithmetic, the register fetches and eventual register store, even getting the displacement value that is embedded in the instruction all occur inside the processor - at processor speeds a very fast instruction indeed. The MVC instruction on the other hand, after it has been fetched into the processor, must perform address arithmetic to develop the address to move data to, and the address to move data from, then the movement of data occurs - and remember movement of data within processor memory occurs at a different speed - a completely different order of magnitude of speed. A reasonable expectation of the speed of the MVC instruction can therefore "throw out" the entire process of developing addresses, and rely entirely on the number of bytes of data to be moved, which is where all (not really all - but the vast majority of all) the time is spent for an MVC instruction. So you see, an MVC instruction that only needs to move 16 bytes of data could execute in roughly half the time of a move of 32 characters of data, and certainly much, much faster than a move of an entire 133 byte print line for instance.

Now to continue the explanation of differences in speed of instructions you can break most instructions down into those that execute entirely within the processor, such as LA, AR, SR, MR, etc., and I would throw in those instructions that use an "immediate" value or one that uses a value that is embedded directly inside the instruction since the instruction is inside the processor at the time the instruction is executed. So you can see that an " AHI " ADD HALFWORD IMMEDIATE instruction that adds a halfword of data from the instruction to a register will execute in about the same length of time as an LA, or AR instruction since all the pieces necessary to perform the function exist within the processor - the register to be added to, as well as the halfword embedded addend.

The next slowest class of instructions, would be those that have a single storage location access, such as ST, MVI, L, etc. I should mention that the direction of data movement, either from processor to memory, or from memory to processor does not cause any significant difference in processing speed, so we can concentrate on the number of accesses and the amount of data accessed.

The slowest of all instructions, classified as we have done here, are those that have two (or more) storage location accesses. Instructions belonging to this class would include the MVC, PACK, UNPACK, CLC, TR, etc.

Of course moving data to an external medium such as tape or dasd becomes much more complex that anything that can be done with a single instruction, but physical I/O to external media could constitue yet another much slower than everything else category of functions.

Long ago when I first started in this business, and the earth was still cooling, IBM provided us with the 'speed' of individual instructions on a new model processor. Which is yet another testiment to the fact that all instructions are not equal. It really is not possible for them to do that any longer, primarily because of the features that I am about to describe, they can cause an instruction to execute at different apparent speeds depending on where the data it accesses is located, and what the access pattern for that data is. It is also true that the apparent speed of an instruction can be affected by the instructions that are located just prior to, or just following the one we would like to get a speed measurment from. So to summarize, all instructions are not equal and do not take the same length of time to execute under ideal circumstance, but the same instruction will operate at different speeds at different times because of the engineering 'assists' that I am about to describe. The engineers that design the newest processors are extremely creative. So that memory accesses occur faster than they otherwise would, cache was built into the processor design, the more data that can be brought into this expensive and special purpose storage area the quicker the processor can access it, up to a point of course. On the newest machines we cache program instructions and program data independently of each other, further more there are multiple levels of each type of cache, data or instructions. Like all engineering tricks, this works most of the time. Unfortunately a programmer that is unaware of this cache, can unintentionally write code causing the instruction cache to become invalidated, forcing it to be reloaded (perhaps over and over again) seriously degrading the speed of his or her program. I should also mention that modern processors use what is known as an instruction pipeline, the last specs I read were for a six level deep instruction pipeline. What this means is that up to six different program instructions can be executing simultaneously - while one instrution is having a base/index/displacement value resolved into an address, another may be decoding the opcode and updating the PSW, while yet another it retrieving data from the internal cache. Sometimes an instruction can execute so quickly that it completes prior to physically preceeding instructions, and then if an earlier physical instruction is eventually processed that would have affected the result of an already completed (but logically following) instruction, the original results must be thrown out and redone. Again, I'm not trying to worry you that your instructions may not be executed in order - they will always APPEAR to be executed in the proper order - whether they really were or not. Nor am I trying to worry you about the internal functions of the ALU (arithmetic logic unit) of a processor. What I am trying to show you that this is a VERY VERY COMPLEX matter to try to determine a base instruction speed, or even the speed of a single given instruction, when the speed that it executes with can be influenced by the locality of data, and even what the preceeding or following instructions in any piece of code are. Fortunately most of the instances where worse performance is derived from code that is written without taking into account these engineering tricks (or assists) that appear to speed up the processor are avoided by simply using reentrant coding techniques.

You should now have a firm grasp of WHY all instructions are not equal and therefore why a MIP is not a MIP, but rather a clever handle that is of general use to mostly sales people and upper level managers. A real, and serious discussion of the speed of any given processor, or processor complex is heavily dependent on your individual workload, and one best approached by your capacity and performance analysts with sophisticated modeling tools that will, if used properly result in an excellent guess as to the actual throughput you will see in your shop for your particular workload where processor speed is but one of many components used to come to a reasonable conclusion.

I once knew a wise old guy who had a hand printed sign on his desk that read;

"ALL COMPLEX DIFFICULT TO UNDERSTAND PROBLEMS HAVE ... SIMPLE, LOGICAL, EASY TO UNDERSTAND
WRONG ANSWERS"

Trying to assign a mips rating to a new processor is a lot like that - simple, logical, easy to understand, but basically wrong. Oh, and look - I still have that sign on my desk....

For more information you may want to start with
CMG
or perhaps
Cheryl Watson's excellent web site.

In the meantime, if you have a piece of code that needs to be done in assembler ( HLASM ) for speed or efficiency, and need someone who understands these differences - remember -

"IF IT'S ASSEMBLER, WE CAN HELP!"

Use this link to Contact Us. if your e-mail system is on this machine.

Stephen.McColley@MVSProgrammer.com

Return to List of Articles


Valid XHTML 1.0 Transitional Valid CSS!


Last updated August 2011 - copyright © 2011 McColley Systems Software Inc.
Return to top of Page.