快速阅读非常大的表作为数据文件

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly completely different approach to the problem?

转载于:https://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

8条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
叼花硬汉 2009-11-13 10:35
关注
An update, several years later

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

read.csv.raw from iotools provides a third option for quickly reading CSV files.

Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.

The original answer

There are a couple of simple things to try, whether you use read.table or scan.

Set nrows=the number of records in your data (nmax in scan).

Make sure that comment.char="" to turn off interpretation of comments.

Explicitly define the classes of each column using colClasses in read.table.

Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with ~~save~~ saveRDS, then next time you can retrieve it faster with ~~load~~ readRDS.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(7条)

报告相同问题？

关注问题

请教一个java 操作excel大数据量的问题
2010-06-12 12:06

回答 24 已采纳 http://gaosheng08.iteye.com/blog/624758 这是本人用poi的eventmodel写的大数据量的excel的读取程序。也许对楼主用帮助。
将excel文件导出不作任何操作再导入，用poi解析不到数据
2011-10-11 15:17

回答 8 已采纳 [quote]我觉得还是要考虑这个焦点，只要excel中有焦点，就可以解析了 [/quote] 你要这么考虑方向就偏了，两个应用程序，你怎么可能去得到excel的焦点了。只能说你得到excel
Hibernate中数据表关联的详解
2012-10-16 16:26

回答 1 已采纳可以参考我的博客 [url=http://jinnianshilongnian.iteye.com/blog/1522591]Hibernate 关系映射收集、总结整理[/url]
C++重温笔记(十二): C++多文件编程
2021-12-24 11:29

翻滚的小@强的博客 c++在线编译工具，可快速进行实验: https://www.bejson.com/runcode/cpp920/ 这段时间打算重新把c++捡起来，实习给我的一个体会就是算法工程师是去解决实际问题的，所以呢，不能被算法或者工程局限住，应时刻提高...
怎样读取Excel数据，删除重复项后生成一个新的文件
2010-03-18 10:59

回答 13 已采纳程序已经写好，运行效率还可以，请提供邮箱。
oracle 表空间移动，满，文件需要介质恢复的问题一个。
2009-01-13 16:10

回答 2 已采纳呵呵. 求人不如求己,看来还真对.
mybatis多表插入问题
2014-08-15 13:01

回答 2 已采纳 CartItem插入的时候，[code="xml"] values( #{cartItemId}, #{clothNumber}, #{cart.cartId}, #{clo
基于C语言和GEL语言的Flash编程新方法
2020-08-26 01:05

以TMS320VC5402为例，探讨一种综合运用C语言、数据文件及GEL语言的Flash编程新方法。　该方法完全采用C语言编写烧写程序，解决了指针不能访问高端Flash的问题;把引导表作成数据文件，可实现大引导表的分批次加载;...
如何将csv数据集存入sql server数据库 python sql 机器学习
2021-05-08 00:14

回答 1 已采纳用sqlserver就可以导入了。如果要用编程语言的话，要先读取数据，再要配置列名，再写sql文
Java处理30W数据入库的问题！
2013-04-23 18:35

回答 6 已采纳楼主不妨看下我的代码,处理1000万10几分钟就OK了： [code="java"] package net.ltan.test; import java.io.BufferedRead
读取文件问题
2011-12-03 10:20

回答 3 已采纳 [quote]关键是同一日期的文件要相对应的插入数据库[/quote] 楼主我是认为你的文件夹中每个日期都对应上面的三个文件，比如日期是20111201这个，就对应lat20111201.txt，l
R语言实战 R语言读取不同文件类型中数据的4种方法
2018-09-22 12:07

miss芥末酱的博客 R语言入门到放弃 R语言读取文件中数据的4中方法方法一：直接读取 &gt; getwd() # 获取当前的工作路径将要读取的文件剪切到当前的工作路径，下面读取文件 x&lt;-read.table("text....
jsp读取服务器端指定文件夹下的文件
2012-06-15 07:49

回答 5 已采纳 apache 可以参考apache里的配置如下： Order allow,deny Allow from all
编程语言发展史之：编程语言的未来趋势
2023-09-25 01:00

禅与计算机程序设计艺术的博客 编程语言”这个概念在近几年间已经成为现代科技领域的一个热门话题。它从诞生到今天已经经历了几百年的历史，各个编程语言都各不相同，但其中的共同点无疑就是可以实现一些程序功能。而“未来趋势”，则指的是这一...
C语言实现链表与文件的存取
2020-10-04 11:04

CVE-柠檬i的博客本程序主要功能是建立链表，然后把链表数据存储到文件中，然后把文件数据存储到数组中并输出
没有解决我的问题, 去提问

悬赏问题

¥30 这是哪个作者做的宝宝起名网站
¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！

快速阅读非常大的表作为数据文件

8条回答 默认 最新

悬赏问题

8条回答默认最新