当前位置:  数据库>oracle

Sqoop1.4.4 实现将 Oracle10g 中的增量数据导入 Hive0.13.1 ,并更新Hive中的主表

    来源: 互联网  发布时间:2017-06-09

    本文导语: 设计   步骤1:通过Sqoop将Oracle中表的导入Hive,模拟全量表和增量表 (1)导入:导出所有数据,只要部分字段,导入到Hive指定表里     由于使用了参数,所以该语句可反复执行,往service_tmp表中覆盖插入最新的增量数据...

需求
将Oracle中的业务基础表增量数据导入Hive中,与当前的全量表合并为最新的全量表。
设计
涉及的三张表:
  • 全量表:保存了截止上一次同步时间的全量基础数据表
  • 增量表:增量临时表
  • 更新后的全量表:更新后的全量数据表
 
步骤:
  • 通过Sqoop将Oracle中的表导入Hive,模拟全量表和增量表
  • 通过Hive将“全量表+增量表”合并为“更新后的全量表”,覆盖当前的全量表
  • 步骤1:通过Sqoop将Oracle中表的导入Hive,模拟全量表和增量表
    为了模拟场景,需要一张全量表,和一张增量表,由于数据源有限,所以两个表都来自Oracle中的OMP_SERVICE,全量表包含所有数据,在Hive中名称叫service_all,增量表包含部分时间段数据,在Hive中名称叫service_tmp。
    (1)导入:导出所有数据,只要部分字段,导入到Hive指定表里
    为实现导入Hive功能,需要先配置HCatalog(HCatalog是Hive子模块)的环境变量,/etc/profile中新增:
    export HCAT_HOME=/home/fulong/Hive/apache-hive-0.13.1-bin/hcatalog
     
    执行以下命令导入数据:
     
     
    (2)导入:只导出所需时间范围内的数据,只要部分字段,导入到Hive指定表里
    使用以下命令导入数据:
     
  • 由于使用了参数,所以该语句可反复执行,往service_tmp表中覆盖插入最新的增量数据;
  • Sqoop还支持使用复杂Sql语句查询数据导入,相亲参见http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html的“7.2.3.Free-form Query Imports”章节
  • (3)验证导入结果:列出所有表,统计行数,查看表结构
    hive> show tables;
    OK
    searchlog
    searchlog_tmp
    service_all
    service_tmp
    Time taken: 0.04 seconds, Fetched: 4 row(s)
    hive> select count(*) from service_all;
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=
    Starting Job = job_1407233914535_0013, Tracking URL = http://FBI003:8088/proxy/application_1407233914535_0013/
    Kill Command = /home/fulong/Hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1407233914535_0013
    Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
    2014-08-21 16:51:47,389 Stage-1 map = 0%,  reduce = 0%
    2014-08-21 16:51:59,816 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 1.36 sec
    2014-08-21 16:52:01,996 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2.45 sec
    2014-08-21 16:52:07,877 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.96 sec
    2014-08-21 16:52:17,639 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.29 sec
    MapReduce Total cumulative CPU time: 5 seconds 290 msec
    Ended Job = job_1407233914535_0013
    MapReduce Jobs Launched:
    Job 0: Map: 3  Reduce: 1   Cumulative CPU: 5.46 sec   HDFS Read: 687141 HDFS Write: 5 SUCCESS
    Total MapReduce CPU Time Spent: 5 seconds 460 msec
    OK
    6803
    Time taken: 59.386 seconds, Fetched: 1 row(s)
    hive> select count(*) from service_tmp;
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=
    Starting Job = job_1407233914535_0014, Tracking URL = http://FBI003:8088/proxy/application_1407233914535_0014/
    Kill Command = /home/fulong/Hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1407233914535_0014
    Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
    2014-08-21 16:53:03,951 Stage-1 map = 0%,  reduce = 0%
    2014-08-21 16:53:15,189 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2.17 sec
    2014-08-21 16:53:16,236 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.38 sec
    2014-08-21 16:53:57,935 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 3.78 sec
    2014-08-21 16:54:01,811 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.34 sec
    MapReduce Total cumulative CPU time: 5 seconds 340 msec
    Ended Job = job_1407233914535_0014
    MapReduce Jobs Launched:
    Job 0: Map: 3  Reduce: 1   Cumulative CPU: 5.66 sec   HDFS Read: 4720 HDFS Write: 3 SUCCESS
    Total MapReduce CPU Time Spent: 5 seconds 660 msec
    OK
    13
    Time taken: 75.856 seconds, Fetched: 1 row(s)
    hive> describe service_all;
    OK
    service_code            string
    service_name            string
    service_process         string
    create_time             string
    enable_org              string
    enable_platform         string
    if_del                  string
    Time taken: 0.169 seconds, Fetched: 7 row(s)
    hive> describe service_tmp;
    OK
    service_code            string
    service_name            string
    service_process         string
    create_time             string
    enable_org              string
    enable_platform         string
    if_del                  string
    Time taken: 0.117 seconds, Fetched: 7 row(s)
    步骤2:通过Hive将“全量表+增量表”合并为“更新后的全量表”,覆盖当前的全量表
    合并新表的逻辑如下:
    • 整个tmp表进入最终表中
    • all表的数据中不包含在tmp表service_code范围内的数据全部进入新表
    执行以下sql语句可以合并得到更新后的全量表:
    hive> select * from service_tmp union all select a.* from service_all a left outer join service_tmp b on a.service_code = b.service_code where b.service_code is null;
     
     
     
    注意,将查询结果插入表有以下两类语法:
    • INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
    • INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
     
    到此为止,Hive中的service_all表已经更新为最新的数据!
    在真实场景中,需要结合shell+cron实现该过程的定时执行。

    通过Sqoop实现Mysql / Oracle 与HDFS / Hbase互导数据

    [Hadoop] Sqoop安装过程详解

    用Sqoop进行MySQL和HDFS系统间的数据互导

    Hadoop Oozie学习笔记 Oozie不支持Sqoop问题解决

    Hadoop生态系统搭建(hadoop hive hbase zookeeper oozie Sqoop)

    Hadoop学习全程记录——使用Sqoop将MySQL中数据导入到Hive中


        
     
     

    您可能感兴趣的文章:

     
    本站(WWW.)旨在分享和传播互联网科技相关的资讯和技术,将尽最大努力为读者提供更好的信息聚合和浏览方式。
    本站(WWW.)站内文章除注明原创外,均为转载、整理或搜集自网络。欢迎任何形式的转载,转载请注明出处。












  • 相关文章推荐


  • 站内导航:


    特别声明:169IT网站部分信息来自互联网,如果侵犯您的权利,请及时告知,本站将立即删除!

    ©2012-2021,