douhong4452 2017-05-03 13:27
浏览 46
已采纳

Hadoop起点

Every month I receive a CSV file, around 2 GB size. I import this file in a table in MySql database and this is almost instant.

Then using PHP, I query this table, filter data from this table and write relevant data to several other tables. This take several days - all queries are optimized.

I want to move this data to Hadoop but do not understand what should be the starting point. I am studying Hadoop and I know this can be done using Sqoop but still too confused, where to start in terms of how to migrate this data to Hadoop.

  • 写回答

1条回答 默认 最新

  • drra6593 2017-05-04 07:55
    关注

    Use Apache Spark may be in Python, as it easy to get started with. Though the use of Spark may be overkill, but given its speed and scalability there is no harm in putting some extra effort on this.

    You might want to switch to any other databases that Spark directly provides APIs to access(Hive/Hbase etc). It is optional though because, with little extra code, you can right to MySql only if you don't want to change.

    The overall design would be like this:

    • Your monthly CSV file will be on a known location on HDFS.
    • Spark application will read this file, do any transformations, write the results to MySql(or any other storage)

    Systems involved:

    • HDFS
    • Spark
    • MySql/other storage
    • Optional cluster to make it scalable
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 保护模式-系统加载-段寄存器