什么是"缓存友好"代码？

What is the difference between "cache unfriendly code" and the "cache friendly" code?

How can I make sure I write cache-efficient code?

转载于:https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

8条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
lrony* 2013-05-22 18:39
关注
Preliminaries

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers (few hundred to maybe a thousand bytes total). At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap, but very slow).

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance, since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.

If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).

Slides about memory optimization by Christer Ericson (director of technology @ Sony)

LWN.net's article "What every programmer should know about memory"

Main concepts for cache-friendly code

A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?

The following particular aspects are of high importance to optimize caching:

Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.

Spatial locality: this refers to placing related data close to eachother. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.

Use appropriate c++ containers

A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.

A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!).

Don't neglect the cache in data structure and algorithm design

Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. An common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).

Know and exploit the implicit structure of data

Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:

1 2 3 4

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

Exploiting the ordering (e.g. changing column index first in c++):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached) = 1 + 2 + 3 + 4 --> 2 cache hits, 2 memory accesses

Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory) = 1 + 3 + 2 + 4 --> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice the performance difference can be much larger.

Avoid unpredictable branches

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

This is explained very well here (thanks to @0x90 for the link): Why is it faster to process a sorted array than an unsorted array?

Avoid virtual functions

In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

Common problems

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): How and when to align to cache line size?

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(7条)

报告相同问题？

关注问题

什么是"缓存友好"代码？ c++ matlab
2013-05-22 18:37

回答 8 已采纳 Preliminaries On modern computers, only the lowest level memory structures (the registers) can mo
什么是缓存雪崩？有什么解决方案来防止缓存雪崩？ redis 有问必答
2021-05-28 11:10

回答 5 已采纳 1.1、什么是缓存雪崩？如果缓存集中在一段时间内失效，发生大量的缓存穿透，所有的查询都落在数据库上，造成了缓存雪崩由于原有缓存失效，新缓存未到期间所有原本应该访问缓存的请求都去查询数据库了，而对数据
了解浏览器缓存机制吗？后端缓存
2021-10-08 17:43

回答 1 已采纳你的问题是？
什么是“缓存友好”代码？
2020-04-13 17:08

asdfgh0077的博客 What is the difference between " cache unfriendly code " and the " cache friendly " code? “ 缓存不友好的代
缓存的实现对资源请求是透明的吗？缓存
2023-02-17 16:58

回答 1 已采纳该回答引用ChatGPT当一个应用程序使用 URLA 发出请求时，请求的目标是远端服务器的地址。如果此时在远端服务器和应用程序之间加了一个 CDN，应用程序并不知道 CDN 的存在，它仍然会向远端服务
uniapp页面缓存有什么好方法？前端前端框架微信小程序
2023-02-26 00:37

回答 2 已采纳以下答案引用自GPT-3大模型,请合理使用：，劳烦了 1、首先，你可以在当前页面中设置一个`data`对象，里面存储页面数据，如果是跳转到其他页面，你就可以在这个`data`里保存需要的状态，这样
Mybatis的一级缓存、二级缓存是什么？分别有什么特点？缓存
2017-07-03 17:47

回答 1 已采纳一级缓存基于sqlSession默认开启,在操作数据库时需要构造SqlSession对象，在对象中有一个HashMap用于存储缓存数据。不同的SqlSession之间的缓存数据区域是互相不影响的。
编程爱好者博客地带网站源码，Java语言，代码源文件
2023-05-30 22:28

编程爱好者博客网站源码是一个用Java语言编写的博客网站，提供代码源文件下载。这个网站不仅支持用户注册和登录，还有博文发布、评论等功能。此外，该网站还支持markdown语法，使得博主可以更方便地编写博客文章。 ...
dubbo调用(API方式)如何缓存ServiceConfig对象？开发语言
2020-07-25 03:45

回答 1 已采纳可以用一个单例模式的对象，保存new创建的serviceconfig ``` class Singleton { private static ServiceConfig service
如何在ssh中将以下缓存代码改为Ehcache？ ssh 缓存
2017-06-23 01:30

回答 2 已采纳引入ehcache的jar和配置文件，写个util类，名字和你现在方法名一致。具体的代码，参考 http://www.cnblogs.com/jingmoxukong/p/5975994.h
有没有知道符号函数是什么鬼？缓存
2022-09-19 11:50

回答 1 已采纳度娘说就是判断一个数值是正数还是负数
全网各编程语言的爱心代码合集
2022-11-29 21:31

站在高处看童年.的博客全网各编程语言的爱心代码合集
你的编程能力从什么时候开始突飞猛进？
2022-07-10 00:49

小熊coder的博客在啃掉一本本计算机经典书籍和写下大量代码以后。疫情原因回不去学校，作为一个马上毕业，即将入职腾讯的大四生，分享一下自己的学习历程吧。本人在大学之前从未接触过编程，最开始的编程学习还是在高考完后，从书店...
编程语言理解3-目前主流的编程语言有哪些，分别的应用场景是什么
2022-08-19 15:51

愚昧之山绝望之谷开悟之坡的博客十、GO编程语言Go是谷歌公司推出的一款相对较新的语言，对于web服务器开发、网络开发以及命令行程序开发来说，它是又一个比较优秀的选择。初学编程，学哪种语言比较好？现在，随着技术不断扩展，单纯的会一种编程...
如何编写好的代码？
2021-10-22 08:05

极客重生的博客 hi，各位小伙伴，大家好，最近主导项目正在进行code review，发现不同人写代码风格不一样：完成任务型，怎么简单怎么来，目的快速完成任务，尽量复制粘贴搞定，没有自己的代码设计思想，代...
没有解决我的问题, 去提问

悬赏问题

¥15 逻辑谓词和消解原理的运用
¥15 三菱伺服电机按启动按钮有使能但不动作
¥15 js，页面2返回页面1时定位进入的设备
¥200 关于#c++#的问题，请各位专家解答！网站的邀请码
¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
¥15 （希望可以解决问题）ma和mb文件无法正常打开，打开后是空白，但是有正常内存占用，但可以在打开Maya应用程序后打开场景ma和mb格式。
¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？
¥15 错误 LNK2001 无法解析的外部符号

什么是"缓存友好"代码？

8条回答 默认 最新

Preliminaries

Main concepts for cache-friendly code

Common problems

悬赏问题

8条回答默认最新