dongpao1926 2014-01-13 21:24
浏览 28
已采纳

Hadoop在我的项目中派上用场吗? [关闭]

A couple days ago I was asked by my company to find requirements to start a project. The project is creating an e-book store. The term simple, but the total amount of data is about 4TB and the number of files are around 500,000.

As my team members use php and mysql, I tried to look around apache for big data. I obviously faced apache haadoop, and mysql-cluster for big data. But after several days of digging on google, I'm now just completely confused! I now have these questions:

  1. Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)

  2. Does hadoop ship with it's own special database, or should be used with mysql or etc.?

  3. Does hadoop works only on a cluster, or it works on a single-nod server as fine?

As I faced these terms very recent, I believe that some or all of my questions maybe really silly... But I'll be really grateful if you have other suggestions for this type project.

  • 写回答

2条回答 默认 最新

  • duanjing2013 2014-01-14 00:49
    关注

    Here are my short answers

    • Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)

      • Yes and no. For certain usecases, this is not big enough data while for others, it is. Questions that should be asked and answered

      • Is this data is growing. What is the rate of growth.

      • Are you going to run some analytics on this data from time to time
    • Does hadoop ship with it's own special database, or should be used with mysql or etc.?

      • Yes, Hadoop has HDFS file system, which can store flatfile and can be treated like data repository. But that may not be the best solution. You may want to look at NoSQL DBs like Cassandra, HBase, MongoDB
    • Does hadoop works only on a cluster, or it works on a single-nod server as fine?

      • Technically, yes, hadoop can run on a single nod in Pseudo cluster or standalone mode. But that is used only for learning or testing purpose for development. For any production environment you should think of Hadoop clusters spanning multiple VMs.... Minimum I saw in prod was 6 VM.

    As such 5TB is not very big volume for Relational DB (that supports clustering). But cost of supporting relational DB goes up exponentially with capacity. While with Hadoop and just HDFS, the cost is very low.... add Cassandra or HBase...not much difference. But remember, simply using hadoop, you are looking at a high latency system. If your expectation is that Hadoop will answer your queries in real time ...please look out for other solutions. (eg:queries like list all books checked out to Xyz", then just get it from DB... don't use Hadoop for that query).

    Overall my suggestion will be, take a crash course on Hadoop from youtube, cloudera, try to gain some expertise on what is Hadoop and what is not and then decide. Your questions gives an impression , that you have a long learning curve ahead and it is worth taking that challenge.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 乘性高斯噪声在深度学习网络中的应用
  • ¥15 运筹学排序问题中的在线排序
  • ¥15 关于docker部署flink集成hadoop的yarn,请教个问题 flink启动yarn-session.sh连不上hadoop,这个整了好几天一直不行,求帮忙看一下怎么解决
  • ¥30 求一段fortran代码用IVF编译运行的结果
  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集
  • ¥15 C++ 头文件/宏冲突问题解决
  • ¥15 用comsol模拟大气湍流通过底部加热(温度不同)的腔体
  • ¥50 安卓adb backup备份子用户应用数据失败
  • ¥20 有人能用聚类分析帮我分析一下文本内容嘛
  • ¥30 python代码,帮调试,帮帮忙吧