CPU-Central

AMD has announced that it commenced initial shipments of AMD-K6 processors based on 0.25-micron process technology during the recently completed quarter (Q4, 1997). The initial shipments, which were fabricated in the company' Submicron Development Center (SDC), included both 266- and 233-megahertz versions of the product.

The 0.25-micron process in the SDC also enabled shipments of the first mobile version of the AMD-K6 processor, used in the new Compaq Presario model 1621 notebook PC at 233 megahertz, which has also been already announced.

The K6 was based on NexGen's 686 design. However, when AMD ran into problems with its K5, it bought out NexGen and got its 686 design and the team that created it. Compared the the K5, the K6 has MMX, more on-chip cache (32K instruction, and 32K data), is able to process more instructions in parallel, and runs as higher clock speeds than the K5. For the K6, AMD decided to drop the P-Rating system, which it used in its K5, and uses the standard Mhz rating. The K6 is better than the Pentium MMX in just about every way, and only falls a few percent short of the Pentium II in most tests, and costs a lot less than both of these CPU's. However, as in all other non-Intel CPU's, the FP performance of these CPU's is even less than that of a Pentium MMX, which leads to slightly lower performance in FPU intense applications such as Quake. However, the FP performance difference between the Pentium MMX and the K6 is very small.

The K6 is very compatible to the current Socket 7 motherboards. The 166 and 200 MHz versions run at 2.9 V split voltage, which you can find on all new motherboards. The K6 233 is supposed to get 3.2 V, but some people say it runs fine at 2.9 V as well; however, don't quote me on this.

Technical Features

Specifications	AMD-K6
x86 Decoders	2 Sophisticate, 1 long, 1 vector
Decode Bandwidth: 32-bit typ/max	1.9/2.0
Decode Bandwidth: 16-bit typ/max	1.8/2.0
Average RISCops/x86: 32-bit code	1.2 (lower is better)
Average RISCops/x86: 16-bit code	1.5 (lower is better)
Maximum ROp Issue Rate	6
Speculative Execution	Yes
Out of Order Execution	Yes
Physical Registers	48
Centralized Buffer max/active	24/18
FPU Multiply/ADD Latency	2/2
Pipeline Stages	6
Misaligned Loads	1 cycle penalty
Branch History Table	8192 entries
Branch Prediction Accuracy	95%
Misprediction Penalty	1-4 (Short Pipeline)
Instruction/Data TLB	64/128 entries
L1 Instruction-Cache	32KB +Predecode 2-Way Set-Assoc.
L1 Data Cache	32KB, 2-Way Set-Assoc. (Load+Store)/cycle
Local Bus Bandwidth	528 MB/sec
Local Bus Latency	2 clocks

Innovative RISC86� Microarchitecture

The AMD-K6 processor's RISC86 microarchitecture features a decoupled decode/execution superscalar design that provides enhanced sixth-generation performance and full x86 binary software compatibility. State-of-the-art design techniques include multiple x86 instruction decode, single-clock internal RISC operations, out-of-order execution, data forwarding, speculative execution, and register renaming. The AMD-K6 processor contains parallel decoders, a centralized RISC86 operation scheduler, and seven execution units that support superscalar operation of x86 instructions. These elements are packed into a highly efficient six-stage pipeline.

AMD's innovative RISC86 microarchitecture implements the x86 instruction set by internally decoding x86 instructions into RISC86 operations that directly support the x86 instruction set while adhering to the RISC performance principles of fixed-length encoding, regularized instruction fields, and a large register set. The RISC86 microarchitecture enables higher processor core performance and promotes straightforward extensibility in future designs. Rather than directly executing complex x86 instructions, which have lengths of 1 to 15 bytes, the AMD-K6 processor executes the simpler, fixed-length RISC86 opcodes, while maintaining instruction coding efficiencies found in x86 programs.

The AMD-K6 processor's advanced branch prediction logic implements an 8,192-entry branch history table, a branch target cache, and a return address stack. These design techniques combine to deliver a prediction rate better than 95 percent.

Decoders. The x86 instruction decoding begins before the on-chip instruction cache is filled. Predecode logic determines the length of an x86 instruction on a byte-by-byte basis. This predecode information is stored, along with x86 instructions, in the instruction cache, to be used later by the decoders. The decoders translate up to two x86 instructions per clock into RISC86 operations. These instructions are categorized into three decode types:

Scheduler/Instruction Control Unit. The centralized scheduler or buffer is managed by the Instruction Control Unit (ICU). The ICU buffers and manages up to 24 RISC86 operations at a time. The 24-operation buffer size is optimized for the efficient use of the processor's six-stage RISC86 pipeline and seven parallel execution units. The scheduler accepts up to four RISC86 operations at a time from the decoders. The ICU can simultaneously issue up to six RISC86 operations per clock to the execution units.

Registers. When managing the 24 RISC86 operations, the scheduler uses 48 physical registers contained within the RISC86 microarchitecture. These registers are located in a general register file and are grouped as 24 general registers, plus 24 renaming registers.

Branch Logic. The AMD-K6 processor uses dynamic branch logic to minimize delays due to the branch instructions common in x86 software. The processor's sophisticated dynamic branch logic consists of a branch history/prediction table, a data branch target cache, and a return address stack. The processor implements a two-level branch prediction scheme based on an 8,192-entry branch history table, which stores prediction information used to predict conditional branches. Since the branch history table does not store predicted target addresses, special address Arithmetic Logic Units (ALUs) calculate target addresses on-the-fly during instruction decode. The branch target cache augments predicted branch performance by avoiding a one-clock cache fetch penalty. This specialized target cache supplies the first 16 bytes of target instructions to the decoders when the branches are predicted.

Cache, Instruction Prefetch, and Predecode Bits

The AMD-K6 processor's writeback L1 cache features a separate 32-Kbyte instruction cache and a 32-Kbyte data cache with two-way set associativity. The cache lines are prefetched from main memory using an efficient pipelined burst transaction. As the instruction cache is filled, each instruction byte is analyzed for instruction boundaries using predecoding logic. This technique enables the decoders to efficiently decode multiple instructions in a single pipeline stage.

Cache. The processor's cache design uses a sectored organization. Each sector consists of 64 bytes configured as two 32-byte cache lines, which share a common tag but have separate pairs of MESI (Modified, Exclusive, Shared, Invalid) bits that track the state of each cache line.

Cache Misses. If an instruction or data cache line required for execution does not reside in the processor's L1 cache, the processor performs a burst cache-line fill from memory. To maximize efficiency of this operation, the processor identifies which of the four quadwords in the cache-line contains the required data or instruction. That quadword is the first to be returned to the L1 cache, thus enabling the processor to continue execution as soon as possible. This technique of varying the burst order improves performance by minimizing execution latency after an L1 cache miss.

Prefetching. The AMD-K6 performs cache prefetching for sector replacements only. As a result, the required cache line is filled first, followed by a prefetch of the second cache line. From the perspective of the external bus, the two cache-line fills typically appear as two 32-byte burst read cycles occurring back-to-back or, if allowed, as pipelined cycles.

Predecode Bits. Decoding of x86 instructions is particularly difficult because these variable-length instructions can be from 1 to 15 bytes long. Predecode logic supplies the predecode bits associated with each instruction byte. Among other things, the predecode bits indicate the number of bytes to the start of the next x86 instruction. These bits are stored in an extended instruction cache beside each x86 instruction byte. The predecode bits are passed with the instruction bytes to the decoders where they assist with parallel x86 instruction decoding, thus improving the decoding bandwidth.

Instruction Fetch and Decode

Instruction Fetch. The AMD-K6 processor can fetch up to 16 bytes per clock out of the instruction cache or branch target cache. The fetched information goes into a 16-byte instruction buffer that feeds directly into the decoders. Fetching can occur along a single execution stream with up to seven outstanding branches taken. Instruction fetch logic can retrieve any 16 contiguous bytes of information within a 32-byte boundary. No additional penalty occurs when the 16 bytes of instructions lie across a cache line boundary. The instruction bytes are loaded into the instruction buffer as they are consumed by the decoders.

Instruction Decode. The decode logic is designed to decode multiple x86 instructions per clock cycle. The decode logic accepts x86 instruction bytes and their predecode bits from the instruction buffer, locates the actual instruction boundaries, and generates RISC86 operations from these x86 instructions. RISC86 operations are fixed-format internal instructions, and most execute in a single clock. RISC86 operations combine to perform every function of the x86 instruction set. Some x86 instructions are decoded into as few as zero RISC86 operations or one RISC86 operation. More complex x86 instructions are decoded into several RISC86 operations.

The AMD-K6 processor uses a combination of decoders to convert x86 instructions into RISC86 operations. The hardware includes four decoders:

All of the common, and a few of the uncommon, floating-point instructions are hardware decoded as short decodes. This decode generates a RISC86 floating-point operation and, optionally, an associated floating-point or store operation. Floating-point or ESC (Escape) instruction decode is only allowed in the first short decoder, but non-ESC instructions (excluding MMX instructions) can be decoded simultaneously by the second short decoder.

All MMX instructions are hardware decoded as short decodes. This MMXinstruction decode generates a RISC86 MMX operation and, optionally, an associated MMX^(TM) load or store operation. MMX instruction decode is only allowed in the first short decode, but instructions other than MMX and ESC instructions can be decoded simultaneously by the second short decoder.

Centralized Scheduler

The scheduler is the heart of the AMD-K6 processor. It contains the logic needed to manage out-of-order execution, data forwarding, register renaming, simultaneous issue and retirement of multiple RISC86 operations, and speculative execution. The scheduler's RISC86 operation buffer can hold up to 24 operations. The scheduler can simultaneously issue a RISC86 operation to any available execution unit (store, load, branch, integer, integer/multimedia, or floating point). In total, the scheduler can issue up to six and retire up to four RISC86 operations per clock.

The scheduler and its operation buffer can examine an x86 instruction window equal to 12 x86 instructions at one time. This advantage stems from the fact that the scheduler operates on the RISC86 operations in parallel and allows the AMD-K6 processor to perform dynamic on-the-fly instruction code scheduling for optimized execution. Although the scheduler can issue RISC86 operations for out-of-order execution, it always retires x86 instructions in order.

Execution Units

The AMD-K6 processor contains seven independent execution units, each capable of handling the RISC86 operations.

Branch-Prediction Logic

The AMD-K6 processor's sophisticated branch logic is designed to minimize or hide the impact of changes in program flow. Branches in x86 code fit two categories: unconditional branches (which always change program flow) and conditional branches (which may or may not divert program flow). When a conditional branch is not taken, the processor continues decoding and executing the next instructions in memory. Typical applications have up to 10 percent unconditional branches and another 10-20 percent conditional branches. The AMD-K6 branch logic has been designed to handle this type of program behavior and its effects on instruction execution (i.e., stalls due to delayed instruction fetching and draining of the pipeline).

Branch History Table. The AMD-K6 processor handles unconditional branches without any penalty by redirecting instruction fetching to the target address of the unconditional branch. However, conditional branches require the use of the AMD-K6 processor's built-in dynamic branch-prediction mechanism. A two-level adaptive history algorithm is implemented in an 8,192-entry branch history table, which stores executed branch information, predicts individual branches, and predicts the behavior of groups of branches. To accommodate this large branch history table, the AMD-K6 processor does not store predicted target addresses; instead, the branch target addresses are calculated on the fly using ALUs during the decode stage.

Branch Target Cache. To avoid a one-clock fetch penalty with a branch prediction, a built-in branch target cache supplies the first 16 bytes of instructions directly to the instruction buffer. The branch target cache is organized as 16 entries of 16 bytes. In total, the branch prediction logic achieves branch prediction rates greater than 95 percent.

Return Address Stack. The return address stack is designed to optimize CALL and RET pairs. To save space, software is typically compiled with subroutines that are frequently called from various places in a program. Entry into the subroutine occurs with the execution of a CALL instruction. When the processor encounters a RET instruction, the branch logic pops the address from the stack and begins fetching from that location. To avoid the latency of main memory accesses during CALL and RET operations, the return address stack caches the pushed addresses.

Branch Execution Unit. This unit enables efficient speculative execution, allowing the processor to execute instructions beyond conditional branches before knowing whether the branch prediction was correct. The AMD-K6 processor does not permanently update the x86 registers or memory locations until all speculatively executed conditional branch instructions are resolved. The AMD-K6 processor can support up to seven outstanding branches.

Overclockability: With a good heat sink and fan (these chips run very hot due to the high voltages they use) these chips can be very overclockable.

3D Performance: The FPU performance of the K6 is almost as good as the Pentium MMX, but not quite as good; however, the additional improvements in the K6 significantly help it to perform well under 3D applications.

Overall Performance: The performance of the K6 is outstanding. It's business apllication performance is top notch, and is only out-done (slightly) by the Pentium II. The addition of MMX instructions, and strong FPU performance also help this chip in multimedia applications.

Upgradability: The motherboards that these CPU's are used with definately require split-voltages, which leads to better support for future CPU's. Also, the K6 tends to require newer motherboards that support higher voltage settings, and newer processors will probably require these settings.

Compatibility: The K6 is virtually compatible with every piece of software that the Intel Pentium is compatible with, and I haven't heard any complaints of software incompatibility problems as yet.

Price: With the very reasonable price that you pay for this CPU, you get an excellent business performer, and a relatively good gaming performer. I would recommend this processor to anyone that wants high-end performance, but doesn't want to spend more than necessary.