dongpao1926
2014-01-13 21:24
浏览 28
已采纳

Hadoop在我的项目中派上用场吗? [关闭]

A couple days ago I was asked by my company to find requirements to start a project. The project is creating an e-book store. The term simple, but the total amount of data is about 4TB and the number of files are around 500,000.

As my team members use php and mysql, I tried to look around apache for big data. I obviously faced apache haadoop, and mysql-cluster for big data. But after several days of digging on google, I'm now just completely confused! I now have these questions:

  1. Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)

  2. Does hadoop ship with it's own special database, or should be used with mysql or etc.?

  3. Does hadoop works only on a cluster, or it works on a single-nod server as fine?

As I faced these terms very recent, I believe that some or all of my questions maybe really silly... But I'll be really grateful if you have other suggestions for this type project.

图片转代码服务由CSDN问答提供 功能建议

几天前,我的公司要求我找到启动项目的要求。 该项目正在创建一个电子书商店。 术语简单,但数据总量约为4TB,文件数量约为500,000。

当我的团队成员使用php和mysql时,我试着环顾一下apache for big 数据。 我显然面对apache haadoop和mysql-cluster用于大数据。 但经过几天的谷歌搜索,我现在只是完全糊涂了! 我现在有这些问题:

  1. 这些数据量(4-5TB)是否被视为大数据? (有些消息称,至少5TB的数据应该使用hadoop,其他一些说hadoop的大数据意味着Zetabytes和Petabytes)

  2. hadoop是否带有它自己的特殊数据库 ,或者应该与mysql或者等等一起使用。?

  3. hadoop只能在集群上运行,还是在单节点服务器上工作正常?

    由于我最近面对这些条款,我相信我的部分或全部问题可能真的很愚蠢......但如果你真的很感激我 对此类型项目有其他建议。

  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

2条回答 默认 最新

  • duanjing2013 2014-01-14 00:49
    已采纳

    Here are my short answers

    • Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)

      • Yes and no. For certain usecases, this is not big enough data while for others, it is. Questions that should be asked and answered

      • Is this data is growing. What is the rate of growth.

      • Are you going to run some analytics on this data from time to time
    • Does hadoop ship with it's own special database, or should be used with mysql or etc.?

      • Yes, Hadoop has HDFS file system, which can store flatfile and can be treated like data repository. But that may not be the best solution. You may want to look at NoSQL DBs like Cassandra, HBase, MongoDB
    • Does hadoop works only on a cluster, or it works on a single-nod server as fine?

      • Technically, yes, hadoop can run on a single nod in Pseudo cluster or standalone mode. But that is used only for learning or testing purpose for development. For any production environment you should think of Hadoop clusters spanning multiple VMs.... Minimum I saw in prod was 6 VM.

    As such 5TB is not very big volume for Relational DB (that supports clustering). But cost of supporting relational DB goes up exponentially with capacity. While with Hadoop and just HDFS, the cost is very low.... add Cassandra or HBase...not much difference. But remember, simply using hadoop, you are looking at a high latency system. If your expectation is that Hadoop will answer your queries in real time ...please look out for other solutions. (eg:queries like list all books checked out to Xyz", then just get it from DB... don't use Hadoop for that query).

    Overall my suggestion will be, take a crash course on Hadoop from youtube, cloudera, try to gain some expertise on what is Hadoop and what is not and then decide. Your questions gives an impression , that you have a long learning curve ahead and it is worth taking that challenge.

    已采纳该答案
    打赏 评论
  • dongyan5641 2014-01-13 21:52

    This should be a comment, but it is too long.

    Hadoop is a framework for writing parallel software, originally written by Yahoo. It is loosely based on a framework developed at Google in the 1990s, which in turn was a parallel implementation of map-reduce primitives from the Lisp language. You can think of Hadoop as a bunch of libraries that run either on hardware you own or on hardware on the cloud. These libraries provide a programming interface to java and to other languages. It allows you to take advantage of a cluster of processors and disks (with HDFS). Its major features are scalability and fault tolerance, both very important for large data problems.

    Hadoop implements a programming methodology build around a parallel implementation of map-reduce. That was the original application. Nowadays, lots of things are built on Hadoop. You should start with the Apache project description and Wikipedia page to learn more.

    Several databases support interfaces to Hadoop (Asterdata comes to mind). Often when one thinks of "databases" and "Hadoop", one is thinking of Pig or Hive or some related open-source project.

    As for your question. If your data conforms naturally to a relational database (table with columns connected by keys) then use a relational database. If you need fast performance on web applications with hierarchical data, then learn about NoSQL solutions, such as MongoDB. If your data has a complex structure and requires scalability and you have programming skills on your team, then think about a Hadoop-based component to the solution. And, for a large project, multiple technologies are often needed for different components -- real time operations using NoSQL, reporting using SQL, ad hoc querying using a combination of SQL and Hadoop (for instance).

    打赏 评论

相关推荐 更多相似问题