数据桌 vs dplyr: 一个人能做好别人做不好的事情吗？

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
dplyr has more accessible syntax
dplyr abstracts (or will) potential DB interactions
There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage

dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
data.table supports rolling joins (thanks @dholstius) as well as overlap joins
data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

Benchmarks

I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point data.table becomes substantially faster.
@Arun ran some benchmarks on joins, showing that data.table scales better than dplyr as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has data.table ~6x faster.
(Unverified) has data.table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones (another SO question from comments, thanks danas).
Matt, the main author of data.table, has benchmarked grouping operations on data.table, dplyr and python pandas on up to 2 billion rows (~100GB in RAM).
An older benchmark on 80K groups has data.table ~8x faster

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

转载于:https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
笑故挽风 2014-12-31 08:27
关注
We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.

The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas.

On benchmarks, it would be great to cover these remaining aspects as well:

Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

Benchmark other operations such as update and joins.

Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

# sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA]

But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

# copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))

A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }

By not using shallow(), the old functionality is retained:

bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. }

By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

But it will still lack many features that data.table provides, including (sub)-assignment by reference.

Aggregate while joining:

Suppose you have two data.tables as follows:

DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3

And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

1) aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

# data.table way DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] # dplyr equivalent DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul)

2) do it all in one go (using by = .EACHI feature):

DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

What is the advantage?

We don't have to allocate memory for the intermediate result.

We don't have to group/hash twice (one for aggregation and other for joining).

And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

Have a look at this, this and this posts for real usage scenarios.

In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

Update and joins:

Consider the data.table code shown below:

DT1[DT2, col := i.mul]

adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5)) DF = as.data.frame(DT)

Basic aggregation/update operations.

# case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])

data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

This is what we mean when we say the DT[i, j, by] form is consistent.

Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

While it returns the same result, using filter() here makes the actual operation less obvious.

It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

Aggregation / update on multiple columns

# case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean))

In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

data.table's := requires column names to be provided, whereas dplyr generates it automatically.

In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

Joins

dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??

Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

However dplyr joins do just that. Nothing more. Nothing less.

data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

data.table also has mult = argument which selects first, last or all matches (6).

data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

do()...

dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

.SD's equivalent is .

In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

fread - fast file reader has been available for a long time now.

fwrite - NEW in the current devel, v1.9.7, a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

Automatic indexing - another handy feature to optimise base R syntax as is, internally.

Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

Non-equi joins: is a NEW feature available from v1.9.7+. It allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

setorder() function in data.table that allows really fast reordering of data.tables by reference.

dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

data.table provides faster equivalents of set operations from v1.9.7+ (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.

Finally:

On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

数据桌 vs dplyr: 一个人能做好别人做不好的事情吗？ r语言
2014-01-29 15:21

回答 3 已采纳 We need to cover at least these aspects to provide a comprehensive answer/comparison (in no partic
数据挖掘r语言分组汇总 r语言
2022-12-28 10:01

回答 2 已采纳完整填充后的代码如下，望采纳 # 清空环境中的对象列表 rm(list = ls()) # 加载 dplyr 包 library(dplyr) # 读取 iris 数据集 df <- rea
R语言绘制火山图对数据使用case when函数报错 r语言
2023-04-02 20:53

回答 6 已采纳引用chatGPT作答，这个错误信息的意思是不能将大小为0的对象“..1（left）”与大小为1271的对象“..2（left）”进行循环使用。很可能是由于您的case_when()函数中的某些条件未
r 语言roc_「R」ROC三剑客（一）使用R语言手撕ROC曲线
2020-12-24 21:23

weixin_39921689的博客刚开始我搜索ROC曲线一般跟机器学习相关联，导致我对它的概念有了曲解，理所当然地以为它只是一个用于机器学习的分类器评估标准，所以在绘制曲线前应当使用逻辑回归等模型对数据建模分析。实则不然，ROC曲线适用于...
R语言列线图多因素cox运行时出现subscript out of bounds报错，请问如何解决？ r语言回归有问必答
2022-03-11 21:48

回答 2 已采纳用str(mycox)检查一下变量的情况，可能是某个变量的数据类型导致的，参考：https://github.com/harrelfe/rms/issues/82
出现了can’t subset columns that don’t exist.错误 r语言
2022-08-03 20:35

回答 2 已采纳 newfull<-select(full,-cabin)这一行写错了，应该是： newfull<-select(full, -Cabin)
r语言分类器性能指标 r语言
2022-12-28 10:09

回答 1 已采纳完整的代码如下，望采纳 # 清空环境中的对象列表 rm(list = ls()) # 加载 dplyr 包 library(dplyr) # 读取 iris 数据集 df <- read.c
【数据科学家】数据科学家还能火多久？
2017-11-29 00:00

产业智能官的博客尽管薪资依然高于其他行业的平均水平，但已经开始有人提出这样的疑惑，当有一天人人都会写代码，软件工程师该去做什么？同样的事情也正在发生在数据科学行业。这是Google Trends上Computer ...
修改染色体和位置为RS号 r语言
2023-02-24 09:43

回答 2 已采纳基于Monster 组和GPT的调写：这个问题似乎是因为 a 数据框中没有名为 chromosome:start 的列，导致左连接失败。你需要检查一下 a 数据框中列名的拼写是否正确，并确认是否存在
如何成为数据科学家
2020-07-16 23:46

cumian8165的博客 by Jose Marcial Portilla 通过何塞·马西尔·波蒂... 如何成为数据科学家 (How to become a Data Scientist) Hi! I’m Jose Portilla and I’m an instructor on Udemy with over 250,000 students enrolled acro...
推荐！自学成才的数据科学家告诉要学习数据科学的10件事(附学习资源链接)
2020-06-27 20:15

暮雨潇潇_的博客本文转载于微信公众号Datawhale，译文作者Datewhale，原文作者为Ken Jee，Ken Jee的研究方向为数据挖掘和生物医学，目前是一所大学的全职生物信息学副教授，通过自学数据科学家。转载链接原文链接本文可以分为三...
predict函数 R_RROC三剑客（一）使用R语言手撕ROC曲线
2020-11-20 13:09

weixin_39620653的博客刚开始我搜索ROC曲线一般跟机器学习相关联，导致我对它的概念有了曲解，理所当然地以为它只是一个用于机器学习的分类器评估标准，所以在绘制曲线前应当使用逻辑回归等模型对数据建模分析。实则不然，ROC曲线适用于...
送你一个目录，一站式学习生信！众多干货，有趣有料！
2021-12-20 21:15

生信宝典的博客生信的作用越来越大，想学的人越来越多，不管是为了以后发展，还是为了解决眼下的问题。但生信学习不是一朝一夕就可以完成的事情，也许你可以很短时间学会一个交互式软件的操作，却不能看完程序教学视频...
从一件数据清洗的小事说起
2018-11-26 11:46

R语言中文社区的博客写在前面“ 转载自公众号：大猫的R语言课堂村长，数据科学、指弹吉他及录音工程爱好者，浙大金融学博士在读，在data.table包和MongoDB的使用上有较多经验。问题：...
【机器学习】2017年度盘点：15个最流行的GitHub机器学习项目
2017-12-25 00:00

产业智能官的博客选自Analytics Vidhya作者：Sunil Ray机器之心编译在本文中，作者列出了 2017 年 GitHub 平台上最为热门的知识库，囊括了数据科学、机器学习、深度学习中的各种项目，希望能对大家学习、使用有所帮助。另，小编...
如何用Python和R对故事情节做情绪分析？
2017-08-29 11:24

weixin_34268753的博客想知道一部没看过的影视剧能否符合自己口味，却又怕被剧透？没关系，我们可以用情绪分析来了解故事情节是否足够跌宕起伏。...就拿刚刚播完第7季的《权力的游戏》来说，每周等的时候那叫一个煎熬，就盼着...
推荐！关于学习数据科学的10件事
2020-06-29 08:52

数据分析v的博客我经常在我的YouTube频道DataProfessor上被问到以下有关如何进入数据科学领域的问题：如何成为数据科学家？成为数据科学家的路线图是什么？我应该学习什么课程来学习数据科学？链...
数据科学入门前需要知道的10件事
2020-06-19 17:00

数据派THU的博客来源：大数据文摘本文约7500字，建议阅读10分钟本文为你介绍在学习数据科学时，需要注意的10件事。刚刚加入数据科学的你，是否也有这样的疑问？如何成为数据科学家？成为数据科学家的规划路线...
独家 | 给初级数据科学家的8点建议（附学习资源）
2017-10-30 00:00

「已注销」的博客作者：Robert Chang 翻译：苏金六本文长度为5300字，建议阅读10分钟 Robert Chang曾在Twitter供职，如今在Airbnb也已经工作了两年。回顾自己成为高级数据科学家，他...两年前，我分享了在行业内做数据科学工作的
你想要的宏基因组-微生物组知识全在这(2023.3)
2023-03-02 07:00

刘永鑫Adam的博客宏基因组/微生物组是当今世界科研最热门的研究领域之一，为加强宏基因组学技术和成果交流传播，推动全球华人微生物组领域发展，中科院青年科研人员创立“宏基因组”公众号，联合海内外同行共同打造本领域纯干货技术...
没有解决我的问题, 去提问

悬赏问题

¥15 java 操作 elasticsearch 8.1 实现索引的重建
¥15 数据可视化Python
¥15 要给毕业设计添加扫码登录的功能！！有偿
¥15 kafka 分区副本增加会导致消息丢失或者不可用吗？
¥15 微信公众号自制会员卡没有收款渠道啊
¥15 stable diffusion
¥100 Jenkins自动化部署—悬赏100元
¥15 关于#python#的问题：求帮写python代码
¥20 MATLAB画图图形出现上下震荡的线条
¥15 关于#windows#的问题：怎么用WIN 11系统的电脑克隆WIN NT3.51-4.0系统的硬盘

数据桌 vs dplyr: 一个人能做好别人做不好的事情吗？

Overview

Question

Examples

Data

3条回答 默认 最新

1. Speed

2. Memory usage

3. Syntax

4. Features

悬赏问题

3条回答默认最新