立即注册 登录
About云-梭伦科技 返回首页

arsenduan的个人空间 https://aboutyun.com/?1407 [收藏] [复制] [分享] [RSS]

日志

用SQOOP从SQLSERVER导入数据到(HDFS,HIVE,HBASE)

已有 3920 次阅读2015-10-23 14:41

假定: 用户database:InsideTSQL2008, tablename:Employees

注意:如果需要导入的table “No primary key could be found”, 则需要“specify one with --split-by or perform a quential import with '-m 1'.”

准备:从Microsoft官网下载jdbc驱动, 将sqljdbc_3.0/enu/下的sqljdbc4.jar和sqljdbc.jar包导入到$SQOOP_HOME/lib下。

       下载地址:http://www.microsoft.com/zh-cn/download/confirmation.aspx?id=2505

1. import all-tables

${Installation}/bin/sqoop import-all-tables\

--connect 'jdbc:sqlserver://172.16.102.30; username=sa;\ password=1qaz!QAZ;  database=InsideTSQL2008' -m 1 \

--target-dir /user/grid/mydb/

2. import some tables

${Installation}/bin/sqoop import --verbose \

--driver com.microsoft.sqlserver.jdbc.SQLServerDriver\

 --connect 'jdbc:sqlserver://172.16.102.30; username=sa;\

password=1qaz!QAZ;  database=InsideTSQL2008' \

--table Employees --target-dir /user/grid/Employees_db -m 1

3. list-tables

$SQOOP_HOME/bin/sqoop list-tables \

--connect 'jdbc:sqlserver://172.16.102.30; username=sa; \

password=1qaz!QAZ;  database=InsideTSQL2008'

4. create-hive-table

sqoop  import --warehouse-dir "/user/hive/MyWareHouse" \

--hive-overwrite --connect 'jdbc:sqlserver://172.16.102.30;\ 

username=sa;password=1qaz!QAZ;  database=InsideTSQL2008' \

--table Employees --hive-table tblFoo --hive-import -m 1

注意:导入hive以后因为Hive官方默认的配置问题,是看不到已导入的tables的,需要做如下修改:

$HIVE_HOME/conf/hive-site.xml中修改配置如下:

修改一

将:

<property>

  <name>javax.jdo.option.ConnectionURL</name>

  <value>jdbc:derby:;databaseName=metastore_db;create=true</value>

  <description>JDBC connect string for a JDBC metastore</description>

</property>

修改为:

<property>

  <name>javax.jdo.option.ConnectionURL</name>

  <value>jdbc:derby:;databaseName=/home/grid/hive/metastore_db;create=true</value>

  <description>JDBC connect string for a JDBC metastore</description>

</property>

注意:databaseName的地址为metastore_db存放的地方,默认在$HIVE_HOME/下面。

修改二

官方配置文档有错误,做如下修改:

将:

<property>

  <name>hive.server2.thrift.sasl.qop</name>

  <value>auth</auth>

跟正为:

<property>

  <name>hive.server2.thrift.sasl.qop</name>

  <value>auth</value>

修改以上两项后重启hive 即可。

5. list-databases

sqoop list-databases --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'

6. list-columns-of-a-table

sqoop import -connect 'jdbc:sqlserver://172.16.102.30; \

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'\

--table Employees --columns "empid, lastname, firstname, title"\

-m 1

7. Storing data in SequenceFiles, and setting the generated class name to com.foo.Employee

sqoop import -connect 'jdbc:sqlserver://172.16.102.30; \

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'\

--table Employees --class-name com.foo.Employee \

--as-sequencefile -m 1

8.Specifying the delimiters to use in a text-mode import

sqoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008' \

--table Employees --fields-terminated-by ',' \

--lines-terminated-by '\n' --optionally-enclosed-by '\"' -m

9. import data like "select ... where..." and with user-defined-delimiters

sqoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008' \

--table Employees --where "mgrid = 5" --fields-terminated-by ','\

--lines-terminated-by '\n'  --optionally-enclosed-by '\"' -m 1

10. Changing the splitting column from the default(default is the primary key)

sqoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'\

--table Employees --where "mgrid = 5" --split-by empi\

--fields-terminated-by ',' --lines-terminated-by '\n' \

--optionally-enclosed-by '\"'

11. append data to an existed table

qoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008' \

--table Employees --where "mgrid <> 5" --fields-terminated-by ','\

--lines-terminated-by '\n'  --optionally-enclosed-by '\"' \

--split-by empid --append

12 uses validation to validate the import using the table row count and number of rows copied into HDFS

sqoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'\

--table Employees --validate -m 1

13. Query Imports

sqoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'\

--query "SELECT a.empid, a.lastname FROM Employees AS a WHERE\

a.empid > 2 AND \$CONDITIONS" -split-by a.empid \

--target-dir /user/grid/query_results

注意,--target-dir,-split-by(或m 1)不能少,$CONDITIONS必须有,且为保险起见加反斜杠;

程序执行时会将以上命令展开为:

SELECT a.empid, a.lastname FROM Employees AS a WHERE a.empid > 2 AND  (1 = 0)

14. delete-target-dir

sqoop import --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008'\

--table Employees -m 1 --delete-target-dir

Delete the import target directory if it exists.

注意,不需要指定要删除哪个directory, “delete-target-dir”删除的就是需要import数据的那个directory

15.  codegen

sqoop codegen --connect 'jdbc:sqlserver://172.16.102.30;\

username=sa; password=1qaz!QAZ;  database=InsideTSQL2008' \

--table Employees

注意 codegen的argument, 不含delete-target-dir, 请参考codegen的common argument, codegen arguments, Output line formatting arguments, Input parsing arguments和Hive arguments

16. import tables with sqoop class in programs

code from : http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.sqoop/sqoop/1.3.0-cdh3u1/com/cloudera/sqoop/manager/SQLServerManager.java

 

package com.cloudera.sqoop.manager;

import java.io.IOException;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import com.cloudera.sqoop.SqoopOptions;

import com.cloudera.sqoop.mapreduce.JdbcExportJob;

import com.cloudera.sqoop.mapreduce.SQLServerExportOutputFormat;

import com.cloudera.sqoop.util.ExportException;

                                                            

//Manages connections to SQLServer databases. Requires the SQLServer JDBC driver.

public class  [More ...] SQLServerManager extends GenericJdbcManager {

    public static final Log LOG = LogFactory.getLog(

        SQLServerManager.class.getName());

    // driver class to ensure is loaded when making db connection.

    private static final String DRIVER_CLASS =

    "com.microsoft.sqlserver.jdbc.SQLServerDriver";

    public  [More ...] SQLServerManager(final SqoopOptions opts) {

        super(DRIVER_CLASS, opts);

    }   

 // Export data stored in HDFS into a table in a database.

    @Override

    public void  [More ...] exportTable(ExportJobContext context)

    throws IOException, ExportException {

        context.setConnManager(this);

        JdbcExportJob exportJob = new JdbcExportJob(context, null, null,

            SQLServerExportOutputFormat.class);

        exportJob.runExport();

    }

 // SQLServer does not support the CURRENT_TIMESTAMP() function. Instead it has the notion of keyword CURRENT_TIMESTAMP that resolves to the current time stamp for the database system.

    @Override

    public String  [More ...] getCurTimestampQuery() {

        return "SELECT CURRENT_TIMESTAMP";

    }

}

                                                     


路过

雷人

握手

鲜花

鸡蛋

评论 (0 个评论)

facelist doodle 涂鸦板

您需要登录后才可以评论 登录 | 立即注册

关闭

推荐上一条 /2 下一条