dt4233
dt4233
2018-05-10 22:36

与类似的内存访问相比,C ++中的链表迭代比Go中的慢

  • c++
  • optimization
已采纳

In a variety of contexts I've observed that linked list iteration is consistently slower in C++ than in Go by 10-15%. My first attempt at resolving this mystery on Stack Overflow is here. The example I coded up was problematic because:

1) memory access was unpredictable because of heap allocations, and

2) because there was no actual work being done, some people's compilers were optimizing away the main loop.

To resolve these issues I have a new program with implementations in C++ and Go. The C++ version takes 1.75 secs compared to 1.48 secs for the Go version. This time, I do one large heap allocation before timing begins and use it to operate an object pool from which I release and acquire nodes for the linked list. This way the memory access should be completely analogous between the two implementations.

Hopefully this makes the mystery more reproducible!

C++:

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <boost/timer.hpp>

using namespace std;

struct Node {
    Node *next; // 8 bytes
    int age;   // 4 bytes
};

// Object pool, where every free slot points to the previous free slot
template<typename T, int n>
struct ObjPool
{
    typedef T*       pointer;
    typedef pointer* metapointer;

    ObjPool() :
        _top(NULL),
        _size(0)
    {
        pointer chunks = new T[n];
        for (int i=0; i < n; i++) {
            release(&chunks[i]);
        }
    }

    // Giver an available pointer to the object pool
    void release(pointer ptr)
    {
        // Store the current pointer at the given address
        *(reinterpret_cast<metapointer>(ptr)) = _top;

        // Advance the pointer
        _top = ptr;

        // Increment the size
        ++_size;
    }

    // Pop an available pointer off the object pool for program use
    pointer acquire(void)
    {
        if(_size == 0){throw std::out_of_range("");}

        // Pop the top of the stack
        pointer retval = _top;

        // Step back to the previous address
        _top = *(reinterpret_cast<metapointer>(_top));

        // Decrement the size
        --_size;

        // Return the next free address
        return retval;
    }

    unsigned int size(void) const {return _size;}

protected:
    pointer _top;

    // Number of free slots available
    unsigned int _size;
};

Node *nodes = nullptr;
ObjPool<Node, 1000> p;

void processAge(int age) {
    // If the object pool is full, pop off the head of the linked list and release
    // it from the pool
    if (p.size() == 0) {
        Node *head = nodes;
        nodes = nodes->next;
        p.release(head);
    }

    // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.
    Node *node = nodes;
    Node *prev = nullptr;
    while (true) {
        if (node == nullptr || age < node->age) {
            Node *newNode = p.acquire();
            newNode->age = age;
            newNode->next = node;

            if (prev == nullptr) {
                nodes = newNode;
            } else {
                prev->next = newNode;
            }

            return;
        }

        prev = node;
        node = node->next;
    }
}

int main() {
    Node x = {};
    std::cout << "Size of struct: " << sizeof(x) << "
"; // 16 bytes

    boost::timer t;
    for (int i=0; i<1000000; i++) {
        processAge(i);
    }

    std::cout << t.elapsed() << "
";
}

Go:

package main

import (
    "time"
    "fmt"
    "unsafe"
)

type Node struct {
    next *Node // 8 bytes
    age int32 // 4 bytes
}

// Every free slot points to the previous free slot
type NodePool struct {
    top *Node
    size int
}

func NewPool(n int) NodePool {
    p := NodePool{nil, 0}
    slots := make([]Node, n, n)
    for i := 0; i < n; i++ {
        p.Release(&slots[i])
    }

    return p
}

func (p *NodePool) Release(l *Node) {
    // Store the current top at the given address
    *((**Node)(unsafe.Pointer(l))) = p.top
    p.top = l
    p.size++
}

func (p *NodePool) Acquire() *Node {
    if p.size == 0 {
        fmt.Printf("Attempting to pop from empty pool!
")
    }
    retval := p.top

    // Step back to the previous address in stack of addresses
    p.top = *((**Node)(unsafe.Pointer(p.top)))
    p.size--
    return retval
}

func processAge(age int32) {
    // If the object pool is full, pop off the head of the linked list and release
    // it from the pool
    if p.size == 0 {
        head := nodes
        nodes = nodes.next
        p.Release(head)
    }

    // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.
    node := nodes
    var prev *Node = nil
    for true {
        if node == nil || age < node.age {
            newNode := p.Acquire()
            newNode.age = age
            newNode.next = node

            if prev == nil {
                nodes = newNode
            } else {
                prev.next = newNode
            }
            return
        }

        prev = node
        node = node.next
    }
}

// Linked list of nodes, in ascending order by age
var nodes *Node = nil
var p NodePool = NewPool(1000)

func main() {
    x := Node{};
    fmt.Printf("Size of struct: %d
", unsafe.Sizeof(x)) // 16 bytes

    start := time.Now()
    for i := 0; i < 1000000; i++ {
        processAge(int32(i))
    }

    fmt.Printf("Time elapsed: %s
", time.Since(start))
}

Output:

clang++ -std=c++11 -stdlib=libc++ minimalPool.cpp -O3; ./a.out
Size of struct: 16
1.7548

go run minimalPool.go
Size of struct: 16
Time elapsed: 1.487930629s
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • dpswo40440 dpswo40440 3年前

    The big difference between your two programs is that your Go code ignores errors (and will panic or segfault, if you're lucky, if you empty the pool), while your C++ code propagates errors via exception. Compare:

    if p.size == 0 {
        fmt.Printf("Attempting to pop from empty pool!
    ")
    }
    

    vs.

    if(_size == 0){throw std::out_of_range("");}
    

    There are at least three ways1 to make the comparison fair:

    1. Can change the C++ code to ignore the error, as you do in Go,
    2. Change both versions to panic/abort on error.
    3. Change the Go version to handle errors idiomatically,2 as you do in C++.

    So, let's do all of them and compare the results3:

    • C++ ignoring error: 1.059329s wall, 1.050000s user + 0.000000s system = 1.050000s CPU (99.1%)
    • C++ abort on error: 1.081585s wall, 1.060000s user + 0.000000s system = 1.060000s CPU (98.0%)
    • Go panic on error: Time elapsed: 1.152942427s
    • Go ignoring error: Time elapsed: 1.196426068s
    • Go idiomatic error handling: Time elapsed: 1.322005119s
    • C++ exception: 1.373458s wall, 1.360000s user + 0.000000s system = 1.360000s CPU (99.0%)

    So:

    • Without error handling, C++ is faster than Go.
    • With panicking, Go gets faster,4 but still not as fast as C++.
    • With idiomatic error handling, C++ slows down a lot more than Go.

    Why? This exception never actually happens in your test run, so the actual error-handling code never runs in either language. But clang can't prove that it doesn't happen. And, since you never catch the exception anywhere, that means it has to emit exception handlers and stack unwinders for every non-elided frame all the way up the stack. So it's doing more work on each function call and return—not much more work, but then your function is doing so little real work that the unnecessary extra work adds up.


    1. You could also change the C++ version to do C-style error handling, or to use an Option type, and probably other possibilities.

    2. This, of course, requires a lot more changes: you need to import errors, change the return type of Acquire to (*Node, error), change the return type of processAge to error, change all your return statements, and add at least two if err != nil { … } checks. But that's supposed to be a good thing about Go, right?

    3. While I was at it, I replaced your legacy boost::timer with boost::auto_cpu_timer, so we're now seeing wall clock time (as with Go) as well as CPU time.

    4. I won't attempt to explain why, because I don't understand it. From a quick glance at the assembly, it's clearly optimized out some checks, but I can't see why it couldn't optimize out those same checks without the panic.

    点赞 评论 复制链接分享