Will HBM replace DDR and become Computer Memory?

Alan   utmel.com   2021-10-21 17:38:58


Topics covered in this article:

What is HBM?

Can HBM memory match the CPU?

Disadvantages of HBM

Is HBM suitable for PC memory?


What is HBM?

HBM (High Bandwidth Memory) is a new type of CPU/GPU memory chip (ie "RAM"). In fact, many DDR chips are stacked together and packaged with the GPU to achieve a large-capacity, high-bit-width DDR combination array.

 HBM plan view.jpg

HBM plan view

The middle die is GPU/CPU, and the 4 small dies on the left and right sides are the stack of DDR particles. In the stacking, there are generally only three stacks of 2/4/8, and a maximum of 4 layers in a three-dimensional stack.

When HBM (High Bandwidth Memory) existed as a GPU memory, it seems to be not uncommon now. Many people may know that HBM is expensive, so even if it is not rare, you can only see it on high-end products. For example, Nvidia's GPU for data centers. AMD's use of HBM on consumer GPUs is a rare example.

Some gamers should know that HBM is a high-speed memory with a bandwidth far exceeding DDR/GDDR. Its internal structure also uses 3D stacked SRAM. Some PC users have imagined that if HBM memory can be used in general personal computers and notebook products. Although the cost is high, there are many rich owners in this industry. Besides, isn't the GPU just using HBM?

Can HBM memory match the CPU?

In fact, the central processing unit that matches with HBM is not non-existent. The A64FX chip used in Fujitsu's supercomputer Fugaku is matched with HBM2 memory. In addition, Intel's soon-to-release Sapphire Rapids Xeon processors will have an HBM memory version next year. There are also such as NEC SX-Aurora TSUBASA.

Then we know that CPU with HBM is at least feasible (although in a strict sense, chips such as A64FX have exceeded the scope of CPU), but these products are still for data center or HPC applications. Is it because it is expensive, so it is not decentralized to the consumer market? This may be an important reason or relatively close to the source. In this article, we will take the opportunity of talking about HBM to talk about the characteristics and usage scenarios of this kind of memory, and whether it will replace the very common DDR memory on computers in the future.

 Looking at HBM from above source Fujitsu.jpg

Looking at HBM from above, source: Fujitsu

In terms of the common form of HBM, it usually exists in the form of a few die (package) from the surface. It is very close to the main chip (such as CPU, GPU). For example, like the picture above, A64FX looks like this, and the surrounding 4 packages are all HBM memory. This type of existence is quite different from general DDR memory.

One of the characteristics of HBM is to achieve higher transmission bandwidth with a smaller size and higher efficiency (partially) than DDR/GDDR. And in fact, each HBM package is stacked with multiple layers of DRAM die, so it is also a 3D structure. DRAM die is connected by TSV (Through Silicon Via) and microbump. In addition to the stacked DRAM die, there will be an HBM in the lower layer Controller logic die. Then the bottom layer is interconnected with CPU/GPU, etc. through base die (for example, silicon interposer).

 Looking at HBM from the side source AMD.jpg

Looking at HBM from the side, source: AMD

From this structure, it is not difficult to find that the interconnect width is much larger than that of DDR/GDDR, and the number of interconnected contacts below can be far more than the number of lines connecting the DDR memory to the CPU. The implementation scale of HBM2's PHY interface is not on the same level as the DDR interface; the connection density of HBM2 is much higher. From the perspective of transmission bit width, each layer of DRAM die has two 128-bit channels, and the total HBM memory with the height of the 4-layer DRAM die is 1024 bits wide. Many GPUs and CPUs have 4 pieces of such HBM memory around them, and the total bit width is 4096 bits.

For comparison, each channel of GDDR5 memory is 32bit wide, and 16 channels are 512bit in total. In fact, the current mainstream second-generation HBM2 can stack up to 8 layers of DRAM die per stack, which has improved capacity and speed. Each stack of HBM2 supports up to 1024 data pins, and the transfer rate of each pin can reach 2000Mbit/s, so the total bandwidth is 256Gbyte/s. Under the transfer rate of 2400Mbit/s per pin, the bandwidth of an HBM2 stack package is 307Gbyte/s.

 a comparison of DDR, LPDDR, GDDR and HBM.jpg

Source: Synopsys

The above picture is a comparison of DDR, LPDDR, GDDR, and HBM given by Synopsys. You can see the abilities of other players in the Max I/F BW column, which is not in the same order of magnitude as HBM2. With such a high bandwidth, in applications such as highly parallel computing, scientific computing, computer vision, and AI, it is simply a refreshing rhythm. And from an intuitive point of view, HBM and the main chip are so close, theoretically higher transmission efficiency can be obtained (from the energy consumption per bit of data transmission, HMB2 does have a great advantage).

I feel that in addition to the cost and total memory capacity of HBM if it is really used as memory on a personal computer, wouldn't it be perfect?

Disadvantages of HBM

Poor flexibility

This type of memory of HBM was first initiated by AMD in 2008. AMD's original intention for HBM was to make changes in power consumption and the size of computer memory. In the following years, AMD has been trying to solve the technical problems of die stacking, and later found partners in the industry with storage media stacking experience, including SK Hynix, and some manufacturers in the interposer and packaging fields.

HBM was first manufactured by SK Hynix in 2013. And this year HBM was adopted by the JESD235 standard of JEDEC (Electronic Components Industry Association). The first GPU to use HBM storage was AMD Fiji (Radeon R9 Fury X) in 2015. The following year Samsung began mass production of HBM2-NVIDIA Tesla P100, which was the first GPU to use HBM2 storage.

From the shape of HBM, it is not difficult to find its first shortcoming: the lack of flexibility in system collocation. For PCs in the early years, the expansion of memory capacity is a relatively conventional capability. And the HBM is packaged with the main chip, there is no possibility of capacity expansion, and the specifications are already fixed at the factory. And it is different from the current notebook equipment where DDR memory is soldered to the motherboard. HBM is integrated on the chip by the chip manufacturer-its flexibility will be weaker, especially for OEM manufacturers.

For most chip manufacturers, pushing processors for the mass market (including the infrastructure market), based on various considerations including cost, is unlikely to launch chip SKU models with various memory capacities. The processors pushed by these manufacturers have various configuration models (for example, there are various models of Intel Core processors)-if you consider the difference in subdivided memory capacity, the manufacturing cost may be difficult to support.

Capacity is too small

The second problem with HBM is that memory capacity is more limited than DDR. Although a single HBM package can stack 8 layers of DRAM die, each layer is 8Gbit and 8 layers are 8GByte. Supercomputing chips like A64FX leave 4 HBM interfaces, that is, 4 HBM stack packages and a single chip has a total capacity of 32GByte.

Such a capacity is still too small for DDR. It is very common for ordinary PCs in the consumer market to pile up more than 32GByte of memory. Not only are there a large number of expandable memory slots on PCs and server motherboards, but some DDR4/5 DIMMs are also stacking DRAM die. Using relatively high-end DRAM die stacking, 2-rank RDIMMs (registered DIMMs) can achieve 128GByte capacity-considering 96 DIMM slots in high-end servers, that is at most 12TByte capacity.

HBM DRAM die source Wikipedia.jpg

HBM DRAM die source: Wikipedia

Of course, I have mentioned that HBM and DDR can be mixed together. HBM2 is responsible for high bandwidth but small capacity and DDR4 is responsible for slightly lower bandwidth but large capacity. From a system design perspective, the HBM2 memory in the processor is more like an L4 cache.

High access latency

For PCs, an important reason why HBM has not been applied to CPU main memory is its high latency. Regarding the issue of latency, although many popular science articles will say that its latency is good, or Xilinx described its latency as similar to DDR for FPGAs equipped with HBM, the "latency" in many articles may not be the same latency.

Contemporary DDR memory is generally marked with CL (CAS latency, the clock cycle required for column addressing, indicating the length of the read latency). The CAS delay we are talking about here refers to the waiting time between when the read command (and Column Address Strobe) is issued and the data is ready.

After the memory controller tells the memory that it needs to access the data in a specific location, it takes several cycles to reach the location and execute the instructions issued by the controller. CL is the most important parameter in memory latency. In terms of the length of the delay, the "period" here actually needs to be multiplied by the time per cycle (the higher the overall operating frequency, the shorter the time per cycle).

GDDR5 and HBM.jpg

GDDR5 and HBM

For HBM, one of its characteristics as mentioned earlier is the ultra-wide interconnection width, which determines that the transmission frequency of HBM cannot be too high. Otherwise, the total power consumption and heat cannot be supported and it does not need such high total bandwidth.

The frequency of HBM will indeed be much lower than that of DDR/GDDR. Samsung’s previous Flarebolt HBM2 memory has a transmission bandwidth of 2Gbit/s per pin, which is almost a frequency of 1GHz. Later, there are products that increase the frequency to 1.2GHz. Samsung mentioned that this process also needs to consider reducing the parallel clock interference between more than 5000 TSVs, and increasing the number of heat dissipation bumps between the DRAM die to alleviate the heat problem. In the above figure, AMD lists the frequency of HBM in fact only 500MHz.

Is HBM suitable for PC memory?

The characteristics of high bandwidth and high latency determine that HBM is very suitable as a GPU memory because games and graphics processing themselves are highly predictable and highly concurrent tasks. The characteristic of this type of load is that it requires high bandwidth and is not so sensitive to delay. So HBM will appear on high-end GPU products. According to this reason, HBM is actually very suitable for HPC high-performance computing and AI computing. Therefore, although A64FX and next-generation Xeon processors are CPUs, they will also choose to consider using HBM as memory.

But for personal computers, the tasks to be processed by the CPU are extremely unpredictable, requiring various random storage accesses, and are inherently more sensitive to latency. And the requirements for low latency are often higher than those for high bandwidth requirements. Not to mention the high cost of HBM. This determines that, at least in the short term, it is difficult for HBM to replace DDR on PCs. It seems that this problem is similar to whether GDDR can be applied to PC memory.

But in the long run, no one can predict the situation. As mentioned above, a hybrid solution can be considered. And the storage resources of different levels are undergoing significant changes. For example, not long ago, we also wrote an article that AMD has stacked the L3 cache on the processor to 192MB. For the component of the in-die cache that originally hides the external storage delay, as the cache on the processor chip becomes larger and larger, the delay requirements for the system memory will not be so high.


Topics covered in this article:

What is HBM?

Can HBM memory match the CPU?

Disadvantages of HBM

Is HBM suitable for PC memory?


What is HBM?

HBM (High Bandwidth Memory) is a new type of CPU/GPU memory chip (ie "RAM"). In fact, many DDR chips are stacked together and packaged with the GPU to achieve a large-capacity, high-bit-width DDR combination array.

 HBM plan view.jpg

HBM plan view

The middle die is GPU/CPU, and the 4 small dies on the left and right sides are the stack of DDR particles. In the stacking, there are generally only three stacks of 2/4/8, and a maximum of 4 layers in a three-dimensional stack.

When HBM (High Bandwidth Memory) existed as a GPU memory, it seems to be not uncommon now. Many people may know that HBM is expensive, so even if it is not rare, you can only see it on high-end products. For example, Nvidia's GPU for data centers. AMD's use of HBM on consumer GPUs is a rare example.

Some gamers should know that HBM is a high-speed memory with a bandwidth far exceeding DDR/GDDR. Its internal structure also uses 3D stacked SRAM. Some PC users have imagined that if HBM memory can be used in general personal computers and notebook products. Although the cost is high, there are many rich owners in this industry. Besides, isn't the GPU just using HBM?

Can HBM memory match the CPU?

In fact, the central processing unit that matches with HBM is not non-existent. The A64FX chip used in Fujitsu's supercomputer Fugaku is matched with HBM2 memory. In addition, Intel's soon-to-release Sapphire Rapids Xeon processors will have an HBM memory version next year. There are also such as NEC SX-Aurora TSUBASA.

Then we know that CPU with HBM is at least feasible (although in a strict sense, chips such as A64FX have exceeded the scope of CPU), but these products are still for data center or HPC applications. Is it because it is expensive, so it is not decentralized to the consumer market? This may be an important reason or relatively close to the source. In this article, we will take the opportunity of talking about HBM to talk about the characteristics and usage scenarios of this kind of memory, and whether it will replace the very common DDR memory on computers in the future.

 Looking at HBM from above source Fujitsu.jpg

Looking at HBM from above, source: Fujitsu

In terms of the common form of HBM, it usually exists in the form of a few die (package) from the surface. It is very close to the main chip (such as CPU, GPU). For example, like the picture above, A64FX looks like this, and the surrounding 4 packages are all HBM memory. This type of existence is quite different from general DDR memory.

One of the characteristics of HBM is to achieve higher transmission bandwidth with a smaller size and higher efficiency (partially) than DDR/GDDR. And in fact, each HBM package is stacked with multiple layers of DRAM die, so it is also a 3D structure. DRAM die is connected by TSV (Through Silicon Via) and microbump. In addition to the stacked DRAM die, there will be an HBM in the lower layer Controller logic die. Then the bottom layer is interconnected with CPU/GPU, etc. through base die (for example, silicon interposer).

 Looking at HBM from the side source AMD.jpg

Looking at HBM from the side, source: AMD

From this structure, it is not difficult to find that the interconnect width is much larger than that of DDR/GDDR, and the number of interconnected contacts below can be far more than the number of lines connecting the DDR memory to the CPU. The implementation scale of HBM2's PHY interface is not on the same level as the DDR interface; the connection density of HBM2 is much higher. From the perspective of transmission bit width, each layer of DRAM die has two 128-bit channels, and the total HBM memory with the height of the 4-layer DRAM die is 1024 bits wide. Many GPUs and CPUs have 4 pieces of such HBM memory around them, and the total bit width is 4096 bits.

For comparison, each channel of GDDR5 memory is 32bit wide, and 16 channels are 512bit in total. In fact, the current mainstream second-generation HBM2 can stack up to 8 layers of DRAM die per stack, which has improved capacity and speed. Each stack of HBM2 supports up to 1024 data pins, and the transfer rate of each pin can reach 2000Mbit/s, so the total bandwidth is 256Gbyte/s. Under the transfer rate of 2400Mbit/s per pin, the bandwidth of an HBM2 stack package is 307Gbyte/s.

 a comparison of DDR, LPDDR, GDDR and HBM.jpg

Source: Synopsys

The above picture is a comparison of DDR, LPDDR, GDDR, and HBM given by Synopsys. You can see the abilities of other players in the Max I/F BW column, which is not in the same order of magnitude as HBM2. With such a high bandwidth, in applications such as highly parallel computing, scientific computing, computer vision, and AI, it is simply a refreshing rhythm. And from an intuitive point of view, HBM and the main chip are so close, theoretically higher transmission efficiency can be obtained (from the energy consumption per bit of data transmission, HMB2 does have a great advantage).

I feel that in addition to the cost and total memory capacity of HBM if it is really used as memory on a personal computer, wouldn't it be perfect?

Disadvantages of HBM

Poor flexibility

This type of memory of HBM was first initiated by AMD in 2008. AMD's original intention for HBM was to make changes in power consumption and the size of computer memory. In the following years, AMD has been trying to solve the technical problems of die stacking, and later found partners in the industry with storage media stacking experience, including SK Hynix, and some manufacturers in the interposer and packaging fields.

HBM was first manufactured by SK Hynix in 2013. And this year HBM was adopted by the JESD235 standard of JEDEC (Electronic Components Industry Association). The first GPU to use HBM storage was AMD Fiji (Radeon R9 Fury X) in 2015. The following year Samsung began mass production of HBM2-NVIDIA Tesla P100, which was the first GPU to use HBM2 storage.

From the shape of HBM, it is not difficult to find its first shortcoming: the lack of flexibility in system collocation. For PCs in the early years, the expansion of memory capacity is a relatively conventional capability. And the HBM is packaged with the main chip, there is no possibility of capacity expansion, and the specifications are already fixed at the factory. And it is different from the current notebook equipment where DDR memory is soldered to the motherboard. HBM is integrated on the chip by the chip manufacturer-its flexibility will be weaker, especially for OEM manufacturers.

For most chip manufacturers, pushing processors for the mass market (including the infrastructure market), based on various considerations including cost, is unlikely to launch chip SKU models with various memory capacities. The processors pushed by these manufacturers have various configuration models (for example, there are various models of Intel Core processors)-if you consider the difference in subdivided memory capacity, the manufacturing cost may be difficult to support.

Capacity is too small

The second problem with HBM is that memory capacity is more limited than DDR. Although a single HBM package can stack 8 layers of DRAM die, each layer is 8Gbit and 8 layers are 8GByte. Supercomputing chips like A64FX leave 4 HBM interfaces, that is, 4 HBM stack packages and a single chip has a total capacity of 32GByte.

Such a capacity is still too small for DDR. It is very common for ordinary PCs in the consumer market to pile up more than 32GByte of memory. Not only are there a large number of expandable memory slots on PCs and server motherboards, but some DDR4/5 DIMMs are also stacking DRAM die. Using relatively high-end DRAM die stacking, 2-rank RDIMMs (registered DIMMs) can achieve 128GByte capacity-considering 96 DIMM slots in high-end servers, that is at most 12TByte capacity.

HBM DRAM die source Wikipedia.jpg

HBM DRAM die source: Wikipedia

Of course, I have mentioned that HBM and DDR can be mixed together. HBM2 is responsible for high bandwidth but small capacity and DDR4 is responsible for slightly lower bandwidth but large capacity. From a system design perspective, the HBM2 memory in the processor is more like an L4 cache.

High access latency

For PCs, an important reason why HBM has not been applied to CPU main memory is its high latency. Regarding the issue of latency, although many popular science articles will say that its latency is good, or Xilinx described its latency as similar to DDR for FPGAs equipped with HBM, the "latency" in many articles may not be the same latency.

Contemporary DDR memory is generally marked with CL (CAS latency, the clock cycle required for column addressing, indicating the length of the read latency). The CAS delay we are talking about here refers to the waiting time between when the read command (and Column Address Strobe) is issued and the data is ready.

After the memory controller tells the memory that it needs to access the data in a specific location, it takes several cycles to reach the location and execute the instructions issued by the controller. CL is the most important parameter in memory latency. In terms of the length of the delay, the "period" here actually needs to be multiplied by the time per cycle (the higher the overall operating frequency, the shorter the time per cycle).

GDDR5 and HBM.jpg

GDDR5 and HBM

For HBM, one of its characteristics as mentioned earlier is the ultra-wide interconnection width, which determines that the transmission frequency of HBM cannot be too high. Otherwise, the total power consumption and heat cannot be supported and it does not need such high total bandwidth.

The frequency of HBM will indeed be much lower than that of DDR/GDDR. Samsung’s previous Flarebolt HBM2 memory has a transmission bandwidth of 2Gbit/s per pin, which is almost a frequency of 1GHz. Later, there are products that increase the frequency to 1.2GHz. Samsung mentioned that this process also needs to consider reducing the parallel clock interference between more than 5000 TSVs, and increasing the number of heat dissipation bumps between the DRAM die to alleviate the heat problem. In the above figure, AMD lists the frequency of HBM in fact only 500MHz.

Is HBM suitable for PC memory?

The characteristics of high bandwidth and high latency determine that HBM is very suitable as a GPU memory because games and graphics processing themselves are highly predictable and highly concurrent tasks. The characteristic of this type of load is that it requires high bandwidth and is not so sensitive to delay. So HBM will appear on high-end GPU products. According to this reason, HBM is actually very suitable for HPC high-performance computing and AI computing. Therefore, although A64FX and next-generation Xeon processors are CPUs, they will also choose to consider using HBM as memory.

But for personal computers, the tasks to be processed by the CPU are extremely unpredictable, requiring various random storage accesses, and are inherently more sensitive to latency. And the requirements for low latency are often higher than those for high bandwidth requirements. Not to mention the high cost of HBM. This determines that, at least in the short term, it is difficult for HBM to replace DDR on PCs. It seems that this problem is similar to whether GDDR can be applied to PC memory.

But in the long run, no one can predict the situation. As mentioned above, a hybrid solution can be considered. And the storage resources of different levels are undergoing significant changes. For example, not long ago, we also wrote an article that AMD has stacked the L3 cache on the processor to 192MB. For the component of the in-die cache that originally hides the external storage delay, as the cache on the processor chip becomes larger and larger, the delay requirements for the system memory will not be so high.

13723477211
0