求助 用hadoop-mapreduce-examples-2.7.3.jar怎么查询特定的词
前2篇blog中测试hadoop代码的时候都用到了这个jar,那么很有必要去分析一下源码。
分析源码之前很有必要先写一个wordcount,代码如下
[java] view plain copy
package mytest;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
~/hadoop-2.7.0/bin/hadoop jar my.jar mytest.WordCount /user/hadoop/input /user/hadoop/output3
find ~/ -name *hadoop-mapreduce-examples*
/home/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
/home/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar
/home/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-test-sources.jar
/home/hadoop/hadoop-2.7.0/share/doc/hadoop/hadoop-mapreduce-examples
Main-Class: org.apache.hadoop.examples.ExampleDriver
~/hadoop-2.7.0/bin/hadoop jar my.jar wordcount /user/hadoop/input /user/hadoop/output4
检索相关源码,发现需要2个jar,分别为hadoop-common-2.7.0.jar和hadoop-mapreduce-client-core-2.7.0.jar
使用myeclipse导出为Runnable Jar后,执行
[sql] view plain copy
测试成功
因为有个“package mytest”所以执行的时候需要使用mytest.WorCount!
仔细回忆之前执行命令的时候并没有加上类似mytest.这类的东西就能执行成功。我们去检索源码看看。执行。
[sql] view plain copy
输出内容为
[sql] view plain copy
解压缩hadoop-mapreduce-examples-2.7.0-sources.jar后导入myeclipse查看源码。
检索“grep”字段,发现出现在ExampleDriver.java中,看样这个文件是这个jar的入口。
那么Runnable Jar怎么确定这个文件的入口呢。解压缩Runnable Jar后发现META-INF 中有如下的描述
[sql] view plain copy
原来Runnable Jar是可以配置默认入口的。可以通过myeclipse导出Jar的时候设置默认入口。
将ExampleDriver.java导入自己的工程,修修改改后,测试。执行
[sql] view plain copy
很多东西具体看源码比较详细,以后有特殊的地方可以细细分析。
tip:分析log日志可以发现。
map和reduce中的syso输出到log日志上。
Main中的syso输出到屏幕上。