利用
[color=]MapReduce
实现以下功能:
[color=]
输入文件是下图的格式:
在一行记录是一个IP包的信息
前面的17,-1001107245,1176302299,8507,11857是IP包的五元组
后面的07:11.6 是IP包的时间戳 五元组相同的记录排在一起,时间按从小到大排
要求高手输出文件对于同一个五元组,每秒钟只抽一个记录出来,时间戳取最早的那个时间。
例如输入为:
17,-1001107245,11763022
99,8507,11857 07:11.6
17,-1001107245,1176302299,8507,11857 07:11.7
17,-1001107245,1176302299,8507,11857 07:11.8
17,-1001107245,1176302299,8507,11857 07:12.6
17,-1001107245,1176302299,8507,11857 07:12.7
17,-1001107245,1176302299,8507,11857 07:12.8
17,-1001107245,
1176302299,8507,11857 07:12.9
17,-1001107245,1176302299,8507,11857 07:13.0
17,-1001107245,1176302299,8507,11857 07:13.1
17,-100110
7245,1176302299,8507,11857 07:13.7
17,-1001107245,1176302299,8507,11857 07:13.8
17,-1001107245,1176302299,8507,11857 07:13.9
17,-1001107245,1176302299,8507,11857 07:14.0
17,-1001107245,1176302299,8507,11857 07:14.1
17,-1001107245,1176302299,8507,11857 07:14.8
17,-1
001107245,1176302299,8507,11857 07:14.9
正确输出应为
17,-1001107245,1176302299,8507,11857 07:11.6
17,-1001107245,1176302299,8507,11857 07:12.6
17,-1001107245,1176302299,8507,11857 07:13.7
17,-1001107245,1176302299,8507,11857 07:14.8
在同一秒内的时间用同一种颜色标注
(注:实
际的记录不仅仅只有17,-1001107245,1176302299,8507,11857五元组这一种)
下面是我写的程序(MapReduce部分)
[ol]public staticclass MapClass2 extends Mapper { public void map(LongWritable key, Textvalue, Context context) throws IOException, InterruptedException { String line2=value.toString(); String str2[] = line2.split("\t"); //以制表符分片 String Fele2=str2[0]; //五元组Fele2 String TIMETAG2=str2[1]; //时间戳 //五元组Fele2为key,时间戳TIMETAG2为value context.write(new Text(Fele2), newText(TIMETAG2)); } } public static class Reduce2 extends Reducer { public void reduce(Text key,Iterable values, Context context) throws IOException,InterruptedException { Iterator iterator = values.iterator(); String i=iterator.next().toString(); String j=""; String temp=""; long c=0; while(iterator.hasNext()) { j=iterator.next().toString(); try{ c=compare(i,j); } catch(ParseException e) { // TODOAuto-generated catch block e.printStackTrace(); } //j-i 检测j是否比i超过1s temp=i; if(c>=1000){context.write(key,new Text(temp)); i=j; } } context.write(key, new Text(i));//输出的key:五元组//输出的value:按秒划分后的每一秒的起始时间//输出的每行记录的格式:五元组时间戳} //比较时间的函数compare//该函数已经单独试验过多次, 如果date2比date1超过1秒,则返回大于1000的值。public static long compare(String date1,String date2) throwsParseException{SimpleDateFormat sdf=new SimpleDateFormat("mm:ss.S");Date d1=sdf.parse(date1);Date d2=sdf.parse(date2);returnnew Long(d2.getTime()-d1.getTime());}}[/ol]复制代码
测试1
Input:
17,-1001107245,1176302299,8507,11857
07:11.6
17,-1001107245,1176302299,8507,11857
07:11.7
17,-1001107245,1176302299,8507,11857
07:11.8
17,-1001107245,1176302299,8507,11857
07:12.6
17,-1001107245,1176302299,8507,11857
07:12.7
17,-1001107245,1176302299,8507,11857
07:12.8
17,-1001107245,1176302299,8507,11857
07:12.9
17,-1001107245,1176302299,8507,11857
07:13.0
17,-1001107245,1176302299,8507,11857
07:13.1
17,-1001107245,1176302299,8507,11857
07:13.7
17,-1001107245,1176302299,8507,11857
07:13.8
17,-1001107245,1176302299,8507,11857
07:13.9
17,-1001107245,1176302299,8507,11857
07:14.0
17,-1001107245,1176302299,8507,11857
07:14.1
17,-1001107245,1176302299,8507,11857
07:14.8
17,-1001107245,1176302299,8507,11857
07:14.9
Output:
17,-1001107245,1176302299,8507,11857
07:11.617,-1001107245,1176302299,8507,11857
07:12.617,-1001107245,1176302299,8507,11857
07:13.717,-1001107245,1176302299,8507,11857
07:14.8
测试1结果正确
测试2
Input:
17,-1001107245,1176302299,8507,11857
07:11.6
17,-1001107245,1176302299,8507,11857
07:11.7
17,-1001107245,1176302299,8507,11857
07:11.8
17,-1001107245,1176302299,8507,11857
07:12.6
17,-1001107245,1176302299,8507,11857
07:12.7
17,-1001107245,1176302299,8507,11857
07:12.8
17,-1001107245,1176302299,8507,11857
07:12.9
17,-1001107245,1176302299,8507,11857
07:13.0
17,-1001107245,1176302299,8507,11857
07:13.1
17,-1001107245,1176302299,8507,11857
07:13.7
17,-1001107245,1176302299,8507,11857
07:13.8
17,-1001107245,1176302299,8507,11857
07:13.9
17,-1001107245,1176302299,8507,11857
07:14.0
17,-1001107245,1176302299,8507,11857
07:14.1
17,-1001107245,1176302299,8507,11857
07:14.8
17,-1001107245,1176302299,8507,11857
07:14.917,-1013016906,1243411163,1148,5313
07:05.9
17,-1013016906,1243411163,1148,5313
07:06.3
17,-1013016906,1243411163,1148,5313
07:09.4
17,-1013016906,1243411163,1148,5313
07:10.1
17,-1013016906,1243411163,1148,5313
07:13.1
17,-1013016906,1243411163,1148,5313
07:13.7
(黄色部分为比测试1中的输入增加的部分)
Output:
17,-1001107245,1176302299,8507,11857
07:11.617,-1001107245,1176302299,8507,11857
07:12.617,-1001107245,1176302299,8507,11857
07:13.717,-1001107245,1176302299,8507,11857
07:14.817,-1013016906,1243411163,1148,5313
07:05.917,-1013016906,1243411163,1148,5313
07:09.417,-1013016906,1243411163,1148,5313
07:13.1
(黄色部分为比测试1的输出增加的部分)
测试2结果正确
测试3
Input:
17,-1001107245,1176302299,8507,11857
07:11.6
17,-1001107245,1176302299,8507,11857
07:11.7
17,-1001107245,1176302299,8507,11857
07:11.8
17,-1001107245,1176302299,8507,11857
07:12.6
17,-1001107245,1176302299,8507,11857
07:12.7
17,-1001107245,1176302299,8507,11857
07:12.8
17,-1001107245,1176302299,8507,11857
07:12.9
17,-1001107245,1176302299,8507,11857
07:13.0
17,-1001107245,1176302299,8507,11857
07:13.1
17,-1001107245,1176302299,8507,11857
07:13.7
17,-1001107245,1176302299,8507,11857
07:13.8
17,-1001107245,1176302299,8507,11857
07:13.9
17,-1001107245,1176302299,8507,11857
07:14.0
17,-1001107245,1176302299,8507,11857
07:14.1
17,-1001107245,1176302299,8507,11857
07:14.8
17,-1001107245,1176302299,8507,11857
07:14.9
17,-1013016906,1243411163,1148,5313
07:05.9
17,-1013016906,1243411163,1148,5313
07:06.3
17,-1013016906,1243411163,1148,5313
07:09.4
17,-1013016906,1243411163,1148,5313
07:10.1
17,-1013016906,1243411163,1148,5313
07:13.1
17,-1013016906,1243411163,1148,5313
07:13.7
17,-1013468817,1176302299,20557,10023
07:04.2
17,-1013468817,1176302299,20557,10023
07:04.3
17,-1013468817,1176302299,20557,10023
07:04.4
17,-1013468817,1176302299,20557,10023
07:04.7
17,-1013468817,1176302299,20557,10023
07:05.4
17,-1013468817,1176302299,20557,10023
07:05.7
17,-1013468817,1176302299,20557,10023
07:06.0
17,-1013468817,1176302299,20557,10023
07:06.3
17,-1013468817,1176302299,20557,10023
07:06.6
17,-1013468817,1176302299,20557,10023
07:06.9
17,-1013468817,1176302299,20557,10023
07:07.1
17,-1013468817,1176302299,20557,10023
07:07.3
17,-1013468817,1176302299,20557,10023
07:07.5
17,-1013468817,1176302299,20557,10023
07:07.7
17,-1013468817,1176302299,20557,10023
07:07.8
17,-1013468817,1176302299,20557,10023
07:08.1
17,-1013468817,1176302299,20557,10023
07:08.3
17,-1013468817,1176302299,20557,10023
07:08.5
17,-1013468817,1176302299,20557,10023
07:08.7
17,-1013468817,1176302299,20557,10023
07:09.0
17,-1013468817,1176302299,20557,10023
07:09.4
17,-1013468817,1176302299,20557,10023
07:09.8
17,-1013468817,1176302299,20557,10023
07:10.3
17,-1013468817,1176302299,20557,10023
07:10.7
(
绿色
为测试3比测试2输入多出的部分)
Output:
17,-1001107245,1176302299,8507,11857
07:11.717,-1001107245,1176302299,8507,11857
07:12.717,-1001107245,1176302299,8507,11857
07:13.717,-1001107245,1176302299,8507,11857
07:14.817,-1013016906,1243411163,1148,5313
07:05.917,-1013016906,1243411163,1148,5313
07:09.417,-1013016906,1243411163,1148,5313
07:13.1
17,-1013468817,1176302299,20557,10023
07:04.317,-1013468817,1176302299,20557,10023
07:05.417,-1013468817,1176302299,20557,10023
07:06.617,-1013468817,1176302299,20557,10023
07:07.717,-1013468817,1176302299,20557,10023
07:08.717,-1013468817,1176302299,20557,10023
07:09.8
(
绿色
为测试3比测试2输出多出的部分)
结果不正确
对于五元组17,-1001107245,1176302299,8507,11857
好像是把07:11.7当成第一个来取时间了
测试4
Input17,-1001107245,1176302299,8507,11857
07:11.6
17,-1001107245,1176302299,8507,11857
07:11.7
17,-1001107245,1176302299,8507,11857
07:11.8
17,-1001107245,1176302299,8507,11857
07:12.6
17,-1001107245,1176302299,8507,11857
07:12.7
17,-1001107245,1176302299,8507,11857
07:12.8
17,-1001107245,1176302299,8507,11857
07:12.9
17,-1001107245,1176302299,8507,11857
07:13.0
17,-1001107245,1176302299,8507,11857
07:13.1
17,-1001107245,1176302299,8507,11857
07:13.7
17,-1001107245,1176302299,8507,11857
07:13.8
17,-1001107245,1176302299,8507,11857
07:13.9
17,-1001107245,1176302299,8507,11857
07:14.0
17,-1001107245,1176302299,8507,11857
07:14.1
17,-1001107245,1176302299,8507,11857
07:14.8
17,-1001107245,1176302299,8507,11857
07:14.9
17,-1001107245,1176302299,8507,11857
07:15.0
17,-1001107245,1176302299,8507,11857
07:15.1
17,-100212812,1176302299,39749,2357
07:04.8
17,-100212812,1176302299,39749,2357
07:09.3
17,-100212812,1176302299,39749,2357
07:13.2
17,-1003646089,1243411163,8000,4003
07:12.4
17,-1009742275,1243411163,8800,3354
07:11.5
17,-1009742275,1243411163,8880,3757
07:07.8
17,-1013016906,1243411163,1148,5313
07:05.9
17,-1013016906,1243411163,1148,5313
07:06.3
17,-1013016906,1243411163,1148,5313
07:09.4
17,-1013016906,1243411163,1148,5313
07:10.1
17,-1013016906,1243411163,1148,5313
07:13.1
17,-1013016906,1243411163,1148,5313
07:13.7
17,-1013468817,1176302299,20557,10023
07:04.2
17,-1013468817,1176302299,20557,10023
07:04.3
17,-1013468817,1176302299,20557,10023
07:04.4
17,-1013468817,1176302299,20557,10023
07:04.7
17,-1013468817,1176302299,20557,10023
07:05.4
……
(还有很多内容,这是一个大小为1.5M的txt格式的文件)
这里颜色只是为了将不同五元组分开,便于观察
Output17,-1001107245,1176302299,8507,11857
07:12.917,-1001107245,1176302299,8507,11857
07:13.917,-1001107245,1176302299,8507,11857
07:14.917,-100212812,1176302299,39749,2357
07:13.2
17,-1003646089,1243411163,8000,4003
07:12.417,-1009742275,1243411163,8800,3354
07:11.5
17,-1009742275,1243411163,8880,3757
07:07.817,-1013016906,1243411163,1148,5313
07:09.417,-1013016906,1243411163,1148,5313
07:13.7
17,-1013468817,1176302299,20557,10023
07:09.417,-1013468817,1176302299,20557,10023
07:10.717,-1013468817,1176302299,20557,10023
07:11.717,-1013468817,1176302299,20557,10023
07:12.717,-1013468817,1176302299,20557,10023
07:13.717,-1013468817,1176302299,20557,10023
07:14.7
……
结果不正确
发现输入的记录越多,不正确率越高,好像输入的记录越多,每个五元组按秒划分所选取的起始时间(即程序中的i)越往后。
|
|