在C ++中迭代链接列表比在Go中慢

EDIT: After getting some feedback, I created a new example which should be more reproducible.

I've been writing a project in C++ that involves lots of linked list iteration. To get a benchmark, I rewrote the code in Go. Surprisingly I've found that the Go implementation runs consistently faster by ~10%, even after passing the -O flag to clang++. Probably I'm just missing some obvious optimization in C++ but I've been banging my head against a wall for a while with various tweaks.

Here's a simplified version, with identical implementations in C++ and Go where the Go program runs faster. All it does is create a linked list with 3000 nodes, and then time how long it takes to iterate over this list 1,000,000 times (7.5 secs in C++, 6.8 in Go).

C++:

#include <iostream>
#include <chrono>

using namespace std;
using ms = chrono::milliseconds;

struct Node {
    Node *next;
    double age;
};

// Global linked list of nodes
Node *nodes = nullptr;

void iterateAndPlace(double age) {
    Node *node = nodes;
    Node *prev = nullptr;

    while (node != nullptr) {
        // Just to make sure that age field is accessed
        if (node->age > 99999) {
            break;
        }

        prev = node;
        node = node->next;
    }

    // Arbitrary action to make sure the compiler
    // doesn't optimize away this function
    prev->age = age;
}

int main() {
    Node x = {};
    std::cout << "Size of struct: " << sizeof(x) << "
"; // 16 bytes

    // Fill in global linked list with 3000 dummy nodes
    for (int i=0; i<3000; i++) {
        Node* newNode = new Node;
        newNode->age = 0.0;
        newNode->next = nodes;
        nodes = newNode;
    }

    auto start = chrono::steady_clock::now();
    for (int i=0; i<1000000; i++) {
        iterateAndPlace(100.1);
    }

    auto end = chrono::steady_clock::now();
    auto diff = end - start;
    std::cout << "Elapsed time is :  "<< chrono::duration_cast<ms>(diff).count()<<" ms "<<endl;
}

Go:

package main
import (
    "time"
    "fmt"
    "unsafe"
)

type Node struct {
    next *Node
    age float64
}

var nodes *Node = nil

func iterateAndPlace(age float64) {
    node := nodes
    var prev *Node = nil

    for node != nil {
        if node.age > 99999 {
            break
        }
        prev = node
        node = node.next
    }

    prev.age = age
}

func main() {
    x := Node{}
    fmt.Printf("Size of struct: %d
", unsafe.Sizeof(x)) // 16 bytes

    for i := 0; i < 3000; i++ {
        newNode := new(Node)
        newNode.next = nodes
        nodes = newNode
    }

    start := time.Now()
    for i := 0; i < 1000000; i++ {
        iterateAndPlace(100.1)
    }
    fmt.Printf("Time elapsed: %s
", time.Since(start))
}

Output from my Mac:

$ go run minimal.go
Size of struct: 16
Time elapsed: 6.865176895s

$ clang++ -std=c++11 -stdlib=libc++ minimal.cpp -O3; ./a.out
Size of struct: 16
Elapsed time is :  7524 ms

Clang version:

$ clang++ --version
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

EDIT: UKMonkey brought up the fact that the nodes may be contiguously allocated in Go but not C++. To test this, I allocated contiguously in C++ with a vector, and this did not change the runtime:

// Fill in global linked list with 3000 contiguous dummy nodes
vector<Node> vec;
vec.reserve(3000);
for (int i=0; i<3000; i++) {
    vec.push_back(Node());
}

nodes = &vec[0];
Node *curr = &vec[0];
for (int i=1; i<3000; i++) {
    curr->next = &vec[i];
    curr = curr->next;
    curr->age = 0.0;
}

I checked that the resulting linked list is indeed contiguous:

std::cout << &nodes << " " << &nodes->next << " " << &nodes->next->next << " " << &nodes->next->next->next << "
";
0x1032de0e0 0x7fb934001000 0x7fb934001010 0x7fb934001020

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douju4594 2018-05-10 21:33
关注
Preface: I am not a C++ expert or assembly expert. But I know a little bit of them, enough to be dangerous, perhaps.

So I was piqued and decided to take a look at the assembler generated for the Go, and followed it up with checking it against the output for clang++.

High Level Summary

Later on here, I go through the assembler output for both languages in x86-64 assembler. The fundamental "critical section" of code in this example is a very tight loop. For that reason, it's the largest contributor to the time spent in the program.

Why tight loops matter is that modern CPU's can execute instructions usually faster than relevant values for the code to reference (like for comparisons) can be loaded from memory. In order to achieve the blazing fast speeds they achieve, CPU's perform a number of tricks including pipelining, branch prediction, and more. Tight loops are often the bane of pipelining and realistically branch prediction could be only marginally helpful if there's a dependency relationship between values.

Fundamentally, the traversal loop has four main chunks:

1. If `node` is null, exit the loop. 2. If `node.age` > 999999, exit the loop. 3a. set prev = node 3b. set node = node.next

Each of these are represented by several assembler instructions, but the chunks as output by the Go and C++ are ordered differently. The C++ effectively does it in order 3a, 1, 2, 3b. The Go version does it in order 3, 2, 1. (it starts the first loop on segment 2 to avoid the assignment happening before the null checks)

In actuality, the clang++ outputs a couple fewer instructions than the Go and should do fewer RAM accesses (at the cost of one more floating point register). One might imagine that executing almost the same instructions just in different orders should end up with the same time spent, but that doesn't take into account pipelining and branch prediction.

Takeaways One might be tempted to hand-optimize this code and write assembly if it was a critical but small loop. Ignoring the obvious reasons (it's more risky/more complex/more prone to bugs) there's also to take into account that while the Go generated code was faster for the two Intel x86-64 processors I tested it with, it's possible that with an AMD processor you'd get the opposite results. It's also possible that with the N+1th gen Intel you'll get different results.

My full investigation follows below:

The investigation

NOTE I've snipped examples as short as possible including truncating filenames, and removing excess fluff from the assembly listing, so your outputs may look slightly different from mine. But anyway, I continue.

So I ran go build -gcflags -S main.go to get this assembly listing, and I'm only really looking at iterateAndPlace.

"".iterateAndPlace STEXT nosplit size=56 args=0x8 locals=0x0 00000 (main.go:16) TEXT "".iterateAndPlace(SB), NOSPLIT, $0-8 00000 (main.go:16) FUNCDATA $0, gclocals·2a5305abe05176240e61b8620e19a815(SB) 00000 (main.go:16) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 00000 (main.go:17) MOVQ "".nodes(SB), AX 00007 (main.go:17) MOVL $0, CX 00009 (main.go:20) JMP 20 00011 (main.go:25) MOVQ (AX), DX 00014 (main.go:25) MOVQ AX, CX 00017 (main.go:25) MOVQ DX, AX 00020 (main.go:20) TESTQ AX, AX 00023 (main.go:20) JEQ 44 00025 (main.go:21) MOVSD 8(AX), X0 00030 (main.go:21) MOVSD $f64.40f869f000000000(SB), X1 00038 (main.go:21) UCOMISD X1, X0 00042 (main.go:21) JLS 11 00044 (main.go:21) MOVSD "".age+8(SP), X0 00050 (main.go:28) MOVSD X0, 8(CX) 00055 (main.go:29) RET

In case you lost context, I'll paste the original listing with the line numbers here:

16 func iterateAndPlace(age float64) { 17 node := nodes 18 var prev *Node = nil 19 20 for node != nil { 21 if node.age > 99999 { 22 break 23 } 24 prev = node 25 node = node.next 26 } 27 28 prev.age = age 29 }

A few interesting things I noticed immediately:

It's not generating any code for line 24, prev = node. That is because it's realized that assignment can be cheated: in traversing to get node.next it uses the CX register which is the value of prev. This is probably a nice optimization the SSA compiler can realize is redundant.

The if statement checking for node.age is re-ordered to be after the node = node.nextstuff, that is skipped on the first iteration. You can think of this as more like a do..while loop in that case. Overall minor since it only really changes the first iteration. But maybe that's enough?

So let's jump over to the C++ assembly, which you get from clang++ -S -mllvm --x86-asm-syntax=intel -O3 minimal.cpp.

.quad 4681608292164698112 ## double 99999 # note I snipped some stuff here __Z15iterateAndPlaced: ## @_Z15iterateAndPlaced ## BB#0: push rbp Lcfi0: .cfi_def_cfa_offset 16 Lcfi1: .cfi_offset rbp, -16 mov rbp, rsp Lcfi2: .cfi_def_cfa_register rbp mov rcx, qword ptr [rip + _nodes] xor eax, eax movsd xmm1, qword ptr [rip + LCPI0_0] ## xmm1 = mem[0],zero .p2align 4, 0x90 LBB0_2: ## =>This Inner Loop Header: Depth=1 mov rdx, rax mov rax, rcx movsd xmm2, qword ptr [rax + 8] ## xmm2 = mem[0],zero ucomisd xmm2, xmm1 ja LBB0_3 ## BB#1: ## in Loop: Header=BB0_2 Depth=1 mov rcx, qword ptr [rax] test rcx, rcx mov rdx, rax jne LBB0_2 LBB0_3: movsd qword ptr [rdx + 8], xmm0 pop rbp ret

This is really interesting. The assembly generated is overall fairly similar (ignoring the minor differences in how assemblers list the syntax) - It made a similar optimization about not assigning prev. Furthermore, the C++ seems to have eliminated the need to load 99999 every time the comparison is done (the Go version loads it right before comparison each time).

For replication purposes, versions of things I used (on an x86-64 darwin mac on OSX High Sierra)

$ go version go version go1.9.3 darwin/amd64 $ clang++ --version Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: x86_64-apple-darwin17.4.0
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

在C ++中迭代链接列表比在Go中慢 c++
2018-05-10 13:47

回答 2 已采纳 Preface: I am not a C++ expert or assembly expert. But I know a little bit of them, enough to be d
与类似的内存访问相比，C ++中的链表迭代比Go中的慢 c++
2018-05-10 22:36

回答 1 已采纳 The big difference between your two programs is that your Go code ignores errors (and will panic o
在Golang中迭代列表json对象
2018-08-13 03:22

回答 1 已采纳 the best way will be to Unmarshal the JSON into an struct and iterate over the values, func main
Go语言超全详解（入门级）
2022-07-14 13:26

大家好，我是好同学的博客 Go语言是谷歌2009年发布的第二款开源编程语言（系统开发语言)，它是基于编译、垃圾收集和并发的编程语言。 Go语言专门针对多处理器系统应用程序的编程进行了优化，使用Go编译的程序可以媲美 C / C++代码的速度，而且...
如何在Go中优雅地迭代日期范围
2018-06-22 07:09

回答 1 已采纳 The Go Programming Language Specification Function literals A function literal repres
在Go Routines中迭代地图？
2019-02-07 20:12

回答 1 已采纳 This illustrates it better: https://play.golang.org/p/RiWTxBoNqwD It won't run in the playground
如何在golang模板中迭代一系列数字？
2019-09-02 19:54

回答 2 已采纳 The template package does not support this directly. Create a template function that returns a sl
Go语言基础篇--3.Go语言之【变量+常量+编程规范】
2020-06-18 14:51

EdenGabriel的博客写在前面：该系列blog只记录本人在学习go的过程中认为需要注意和掌握的基本内容，以此共勉，有表述不当和错误的地方还请各位仁兄指正。详细的知识点还是仔细看上述blog中给到的一些学习链接吧变量一小块内存，...
C ++风格的Golang迭代器
2018-02-03 19:11

回答 1 已采纳 Go and C++ are very different languages. Don't try to use C++ STL style in Go. Other Go programmer
Golang中的Reader如何在循环中自动迭代？
2017-12-24 20:59

回答 1 已采纳 How does a Reader in Golang automatically iterate in a loop? How does the reader know which i
如何在不使用迭代器的情况下在Golang中循环？
2016-12-21 19:13

回答 1 已采纳 You can ignore such things by using the blank identifier: _ for _ := range []int{1, 2, 3} { c
c++ 向量复制_在C ++中复制向量的不同方法
2020-07-18 04:02

「已注销」的博客在这里，我们简单地将向量A的内容迭代地复制到向量B中。向量B被定义为新位置，因此它当然是深层副本。我们也通过在复制操作后更改元素来验证这一点。 Example code: 示例代码： #include using namespace std; ...
我无法在golang中的两个json之间进行迭代？
2018-03-13 18:58

回答 1 已采纳 I'm not sure If I got your request right but here's how I "mapped" the data from Json2 to Json1.
linux运行go语言,在 Linux 中使用 Go 作为脚本语言
2021-05-11 03:35

喻石的博客在 `Cloudflare` 的人们都非常喜欢 Go 语言。我们在许多[内部软件项目](https://blog.cloudflare.com/what-weve-been-doing-with-go/)以及更大的[管道系统]...
在 C ++中加载 TorchScript 模型
2023-08-07 08:15

yanglamei1962的博客顾名思义，PyTorch 的主要接口是 Python 编程语言。尽管 Python 是许多需要动态性和易于迭代的场景的合适且首选的语言，但是在同样许多情况下，Python 的这些属性恰恰是不利的。后者经常应用的一种环境是_生产_ –低...
没有解决我的问题, 去提问

悬赏问题

¥15 如何实验stm32主通道和互补通道独立输出
¥30 这是哪个作者做的宝宝起名网站
¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题

在C ++中迭代链接列表比在Go中慢

2条回答 默认 最新

悬赏问题

2条回答默认最新