2017-05-03 13:27



Every month I receive a CSV file, around 2 GB size. I import this file in a table in MySql database and this is almost instant.

Then using PHP, I query this table, filter data from this table and write relevant data to several other tables. This take several days - all queries are optimized.

I want to move this data to Hadoop but do not understand what should be the starting point. I am studying Hadoop and I know this can be done using Sqoop but still too confused, where to start in terms of how to migrate this data to Hadoop.

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答


  • drra6593 drra6593 4年前

    Use Apache Spark may be in Python, as it easy to get started with. Though the use of Spark may be overkill, but given its speed and scalability there is no harm in putting some extra effort on this.

    You might want to switch to any other databases that Spark directly provides APIs to access(Hive/Hbase etc). It is optional though because, with little extra code, you can right to MySql only if you don't want to change.

    The overall design would be like this:

    • Your monthly CSV file will be on a known location on HDFS.
    • Spark application will read this file, do any transformations, write the results to MySql(or any other storage)

    Systems involved:

    • HDFS
    • Spark
    • MySql/other storage
    • Optional cluster to make it scalable
    点赞 评论 复制链接分享