youngwenhao 发表于 2017-5-17 15:54:47

sqoop 数据转移问题 from mysql to hive

我在使用sqoop传输数据库是这么做的:
1, 用hive在集群上先创建一个新的database(数据传输目的地):hivewattest02, 查阅sqoop的几个简单使用方法后,写了个脚本如下,但是执行速度很慢
#! /bin/sh
source /etc/profile
source /etc/bashrc

CONNECTURL=192.168.1.140
PORTNUM=3306
DBNAME=wattest0
USERNAME=root
PASSWORD=1
HIVEDB=hivewattest0

echo `sqoop list-tables -connect jdbc:mysql://${CONNECTURL}:${PORTNUM}/${DBNAME} -username ${USERNAME} -password ${PASSWORD}` > tmptable.log

flag=0

for line in `cat tmptable.log`
do
      if [[ "${line}" == "analysistable" ]]
      then
                flag=1
      fi
      if [[ "${flag}" == "1" ]]
      then
                echo `sqoop import -connect jdbc:mysql://${CONNECTURL}:${PORTNUM}/${DBNAME} -username ${USERNAME} -password ${PASSWORD} -table ${line} -hive-import -hive-table ${HIVEDB}.${line}`
      fi

done


3,运行的部分log如下:其实就是执行下面这个命令的,它耗时很多
`sqoop import -connect jdbc:mysql://${CONNECTURL}:${PORTNUM}/${DBNAME} -username ${USERNAME} -password ${PASSWORD} -table ${line} -hive-import -hive-table ${HIVEDB}.${line}`


Loading data to table hivewattest0.processsteps//hive 中新建的DB和该DB中的某个table
chgrp: changing ownership of 'hdfs://master:8020/user/hive/warehouse/hivewattest0.db/processsteps/part-m-00000': User does not belong to hive
Table hivewattest0.processsteps stats:
OK
Time taken: 0.652 seconds
Warning: /opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/05/17 15:32:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.2
17/05/17 15:32:57 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/05/17 15:32:57 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
17/05/17 15:32:57 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
17/05/17 15:32:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
17/05/17 15:32:58 INFO tool.CodeGenTool: Beginning code generation
17/05/17 15:32:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `productsettings` AS t LIMIT 1
17/05/17 15:32:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `productsettings` AS t LIMIT 1
17/05/17 15:32:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/ae110202d24cd7c86be1f098c7352529/productsettings.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/05/17 15:33:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/ae110202d24cd7c86be1f098c7352529/productsettings.jar
17/05/17 15:33:00 WARN manager.MySQLManager: It looks like you are importing from mysql.
17/05/17 15:33:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
17/05/17 15:33:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
17/05/17 15:33:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
17/05/17 15:33:00 INFO mapreduce.ImportJobBase: Beginning import of productsettings
17/05/17 15:33:00 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/05/17 15:33:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/05/17 15:33:01 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.200.101:8032
17/05/17 15:33:05 INFO db.DBInputFormat: Using read commited transaction isolation
17/05/17 15:33:05 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`ID`), MAX(`ID`) FROM `productsettings`
17/05/17 15:33:05 INFO db.IntegerSplitter: Split size: 16; Num splits: 4 from: 1 to: 65
17/05/17 15:33:05 INFO mapreduce.JobSubmitter: number of splits:4
17/05/17 15:33:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494812274304_0083
17/05/17 15:33:06 INFO impl.YarnClientImpl: Submitted application application_1494812274304_0083
17/05/17 15:33:06 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1494812274304_0083/
17/05/17 15:33:06 INFO mapreduce.Job: Running job: job_1494812274304_0083
17/05/17 15:33:14 INFO mapreduce.Job: Job job_1494812274304_0083 running in uber mode : false
17/05/17 15:33:14 INFO mapreduce.Job:map 0% reduce 0%
17/05/17 15:33:22 INFO mapreduce.Job:map 25% reduce 0%
17/05/17 15:33:23 INFO mapreduce.Job:map 50% reduce 0%
17/05/17 15:33:27 INFO mapreduce.Job:map 75% reduce 0%
17/05/17 15:33:29 INFO mapreduce.Job:map 100% reduce 0%
17/05/17 15:33:29 INFO mapreduce.Job: Job job_1494812274304_0083 completed successfully
17/05/17 15:33:29 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=590064
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=400
                HDFS: Number of bytes written=1214
                HDFS: Number of read operations=16
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=8
        Job Counters
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=19874
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=19874
                Total vcore-seconds taken by all map tasks=19874
                Total megabyte-seconds taken by all map tasks=20350976
        Map-Reduce Framework
                Map input records=45
                Map output records=45
                Input split bytes=400
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=294
                CPU time spent (ms)=6530
                Physical memory (bytes) snapshot=788549632
                Virtual memory (bytes) snapshot=11102887936
                Total committed heap usage (bytes)=610271232
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=1214
17/05/17 15:33:29 INFO mapreduce.ImportJobBase: Transferred 1.1855 KB in 28.0971 seconds (43.2074 bytes/sec)
17/05/17 15:33:29 INFO mapreduce.ImportJobBase: Retrieved 45 records.
17/05/17 15:33:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `productsettings` AS t LIMIT 1
17/05/17 15:33:29 INFO hive.HiveImport: Loading uploaded data into Hive

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/jars/hive-common-1.1.0-cdh5.9.2.jar!/hive-log4j.properties
OK
Time taken: 3.301 seconds




综上:mysql中旧的DB大概有70个table。
整个过程耗时差不多二十分钟,请问有没有别的方法,能够快速将mysql中DB导入hive中的DB。

jixianqiuxue 发表于 2017-5-17 19:45:08

多开几个线程试试sqoop import -connect jdbc:mysql://${CONNECTURL}:${PORTNUM}/${DBNAME} -username ${USERNAME} -password ${PASSWORD} -table ${line} -hive-import -m 10 -hive-table ${HIVEDB}.${line}

youngwenhao 发表于 2017-5-18 09:36:53

jixianqiuxue 发表于 2017-5-17 19:45
多开几个线程试试sqoop import -connect jdbc:mysql://${CONNECTURL}:${PORTNUM}/${DBNAME} -username ${US ...

你好,谢谢你,我等下试试多加几个线程,另外发现在跑的时候集群master的内存消耗很大。但其他主机消耗很小。

nextuser 发表于 2017-5-18 10:05:25

youngwenhao 发表于 2017-5-18 09:36
你好,谢谢你,我等下试试多加几个线程,另外发现在跑的时候集群master的内存消耗很大。但其他主机消耗很 ...

master角色比较多。可能安装的比较多。
页: [1]
查看完整版本: sqoop 数据转移问题 from mysql to hive