doushuichong2589 2019-04-06 18:46
浏览 368
已采纳

在golang中实现全文搜索的有效方法

I am trying to realize a simple full text search in golang but all my implementations turn out to be too slow to overcome the thresholds.

The task is as follows:

  • Documents are non-empty strings of lowercase words divided by spaces

  • Each document has an implicit identifier equal to its index in the input array

  • New() constructs the index

  • Search(): accepts a query, which is also a string of lowercase words divided by spaces, and returns a sorted array of unique identifiers of documents that contains all words from the query regardless of their order

Example:

index := New([]string{
"this is the house that jack built",  //: 0
"this is the rat that ate the malt",  //: 1
})

index.Search("")  // -> []
index.Search("in the house that jack built")  // -> []
index.Search("malt rat")  // -> [1]
index.Search("is this the")  // -> [0, 1]

I have already tried to implement:

  • a binary search tree for each document and for all documents all together

  • a trie (prefix tree) for each document and for all documents all together

  • inverted index search

binary search tree (for all documents):

type Tree struct {
    m           map[int]bool
    word        string
    left        *Tree
    right       *Tree
}

type Index struct {
    tree *Tree
}

binary search tree (a tree for each document):

type Tree struct {
    word  string
    left  *Tree
    right *Tree
}

type Index struct {
    tree  *Tree
    index int
    next  *Index
}

trie (for all documents):

type Trie struct {
    m        map[uint8]*Trie
    end_node map[int]bool
}

type Index struct {
    trie *Trie
}

trie (for each document):

type Trie struct {
    m        map[uint8]*Trie
    end_node bool
}

type Index struct {
    trie  *Trie
    index int
    next  *Index
}

inverted index:

type Index struct {
    m map[string]map[int]bool
}

New and Search implementation for inverted index:

// New creates a fulltext search index for the given documents
func New(docs []string) *Index {
    m := make(map[string]map[int]bool)

    for i := 0; i < len(docs); i++ {
        words := strings.Fields(docs[i])
        for j := 0; j < len(words); j++ {
            if m[words[j]] == nil {
                m[words[j]] = make(map[int]bool)
            }
            m[words[j]][i+1] = true
        }
    }
    return &(Index{m})
}

// Search returns a slice of unique ids of documents that contain all words from the query.
func (idx *Index) Search(query string) []int {
    if query == "" {
        return []int{}
    }
    ret := make(map[int]bool)
    arr := strings.Fields(query)
    fl := 0
    for i := range arr {
        if idx.m[arr[i]] == nil {
            return []int{}
        }
        if fl == 0 {
            for value := range idx.m[arr[i]] {
                ret[value] = true
            }
            fl = 1
        } else {
            tmp := make(map[int]bool)
            for value := range ret {
                if idx.m[arr[i]][value] == true {
                    tmp[value] = true
                }
            }
            ret = tmp
        }
    }
    ret_arr := []int{}
    for value := range ret {
        ret_arr = append(ret_arr, value-1)
    }
    sort.Ints(ret_arr)
    return ret_arr
}

Am I doing something wrong or is there a better algorithm for search in golang?

Any help is appreciated.

  • 写回答

1条回答 默认 最新

  • drci47425 2019-04-08 23:19
    关注

    I can't really help you for the language specific part, but if it's of any help, here is a pseudocode that describes a Trie implementation along with a function to solve your current problem in a decently efficient manner.

    struct TrieNode{
        map[char] children      // maps character to children
        set[int] contains       // set of all ids of documents that contain the word
    }
    
    // classic search function in trie, except it returns a set of document ids instead of a simple boolean
    function get_doc_ids(TrieNode node, string w, int depth){
        if (depth == length(w)){
            return node.contains
        } else {
            if (node.hasChild(w[depth]) {
                return get_doc_ids(node.getChild(w[depth], w, depth+1)
            } else {
                return empty_set()
            }
        }
    }
    
    // the answering query function, as straight forward as it can be
    function answer_query(TrieNode root, list_of_words L){
        n = length(L)
        result = get_docs_ids(root, L[0], 0)
        for i from 1 to n-1 do {
            result = intersection(result, get_docs_ids(root, L[i], 0))  // set intersection 
            if (result.is_empty()){
                break  // no documents contains them all, no need to check further
            }
        }
        return result
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 如何删除这个虚拟音频
  • ¥50 hyper默认的default switch
  • ¥15 网站打不开,提示502 Bad Gateway
  • ¥20 基于MATLAB的绝热压缩空气储能系统代码咨询
  • ¥15 R语言建立随机森林模型出现的问题
  • ¥15 中级微观经济学,生产可能性边界问题
  • ¥15 TCP传输时不同网卡传输用时差异过大
  • ¥15 请各位看看我写的属于什么算法,或者有更正确的写法?
  • ¥15 html5 qrcode 扫描器
  • ¥15 爬取网页信息并保存需要完整代码