两个Hadoop集群间实现数据传输的一个工具介绍-----Distcp
官方文档:http://hadoop.apache.org/docs/stable1/distcp2.html
现在有两个版本分别是Distcp 和Distcp2。自己测试了下,具体效率值没有做对比;
使用方法:
相同版本的hdfs集群间传输:
# hadoop distcp hdfs://master1:9000/foo hdfs://master2:9000/foo
16/01/03 17:32:10 INFO tools.DistCp: Input Options: DistCpOptions
{atomicCommit=false, syncFolder=false, deleteMissing=false,
ignoreFailures=false, maxMaps=20, sslConfigurationFile='null',
copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://
master1:9000/foo], targetPath=hdfs://master:9000/foo, targetPathExists=true,
preserveRawXattrs=false}
16/01/03 17:32:10 INFO client.RMProxy: Connecting to ResourceManager at master1/192.168.211.128:8032
16/01/03 17:32:13 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
16/01/03 17:32:13 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
16/01/03 17:32:14 INFO client.RMProxy: Connecting to ResourceManager at master1/192.168.211.128:8032
16/01/03 17:32:15 INFO mapreduce.JobSubmitter: number of splits:2
16/01/03 17:32:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1451870356825_0001
16/01/03 17:32:17 INFO impl.YarnClientImpl: Submitted application application_1451870356825_0001
16/01/03 17:32:17 INFO mapreduce.Job: The url to track the job: http://master1:8088/proxy/application_1451870356825_0001/
16/01/03 17:32:17 INFO tools.DistCp: DistCp job-id: job_1451870356825_0001
16/01/03 17:32:17 INFO mapreduce.Job: Running job: job_1451870356825_0001
16/01/03 17:32:38 INFO mapreduce.Job: Job job_1451870356825_0001 running in uber mode : false
16/01/03 17:32:38 INFO mapreduce.Job: map 0% reduce 0%
16/01/03 17:32:58 INFO mapreduce.Job: map 50% reduce 0%
16/01/03 17:33:05 INFO mapreduce.Job: map 100% reduce 0%
16/01/03 17:33:05 INFO mapreduce.Job: Job job_1451870356825_0001 completed successfully
16/01/03 17:33:05 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=216186
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1196
HDFS: Number of bytes written=66
HDFS: Number of read operations=33
HDFS: Number of large read operations=0
HDFS: Number of write operations=10
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=42030
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=42030
Total vcore-seconds taken by all map tasks=42030
Total megabyte-seconds taken by all map tasks=43038720
Map-Reduce Framework
Map input records=4
Map output records=0
Input split bytes=266
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=677
CPU time spent (ms)=2450
Physical memory (bytes) snapshot=180961280
Virtual memory (bytes) snapshot=4115144704
Total committed heap usage (bytes)=33157120
File Input Format Counters
Bytes Read=864
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=66
BYTESEXPECTED=66
COPY=4
在另外一个集群的foo目录下能查看到传输过来的数据
在不同的Hdfs版本之间传输的话:
对于不同Hadoop版本间的拷贝,用户应该使用HftpFileSystem。 这是一个只读文件系统,所以distcp必须运行在目标端集群上(更确切的说是在能够写入目标集群的TaskTracker上)。 源的格式是 hftp://dfs.http.address/ (默认情况dfs.http.address是 :50070,我测试的相同版本之间的端口使用的是9000)。
|
|