直到目前,我们看到的所有Mapreduce作业都输出一组文件。但是,在一些场合下,经常要求我们将输出多组文件或者把一个数据集分为多个数据集更为方便;比如将一个log里面属于不同业务线的日志分开来输出,并交给相关的业务线。
用过旧API的人应该知道,旧API中有 org.apache.hadoop.mapred.lib.MultipleOutputFormat和org.apache.hadoop.mapred.lib.MultipleOutputs,文档对MultipleOutputFormat的解释(MultipleOutputs 解释在后面)是:
MultipleOutputFormat可以将相似的记录输出到相同的数据集。在写每条记录之前,MultipleOutputFormat将调用generateFileNameForKeyValue方法来确定需要写入的文件名。通常,我们都是继承MultipleTextOutputFormat类,来重新实现generateFileNameForKeyValue方法以返回每个输出键/值对的文件名。generateFileNameForKeyValue方法的默认实现如下:
protected String generateFileNameForKeyValue(K key, V value, String name) { return name; }
返回默认的name,我们可以在自己的类中重写这个方法,来定义自己的输出路径,比如:
public static class PartitionFormat extends MultipleTextOutputFormat<NullWritable, Text> { @Override protected String generateFileNameForKeyValue( NullWritable key, Text value, String name) { String[] split = value.toString().split(",", -1); String country = split[4].substring(1, 3); return country + "/" + name; } }
这样相同country的记录将会输出到同一目录下的name文件中。完整的例子如下:
package com.wyp; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import java.io.IOException; /** * User: https://www.iteblog.com/ * Date: 13-11-26 * Time: 上午10:02 */ public class OutputTest { public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, NullWritable, Text> { @Override public void map(LongWritable key, Text value, OutputCollector<NullWritable, Text> output, Reporter reporter) throws IOException { output.collect(NullWritable.get(), value); } } public static class PartitionFormat extends MultipleTextOutputFormat<NullWritable, Text> { //和上面一样,就不写了 } public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); JobConf job = new JobConf(conf, OutputTest.class); String[] remainingArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (remainingArgs.length != 2) { System.err.println("Error!"); System.exit(1); } Path in = new Path(remainingArgs[0]); Path out = new Path(remainingArgs[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setJobName("Output"); job.setMapperClass(MapClass.class); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(PartitionFormat.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); job.setNumReduceTasks(0); JobClient.runJob(job); } }
将上面的程序打包成jar文件(具体怎么打包,就不说),并在Hadoop2.2.0上面运行(测试数据请在这里下载:http://pan.baidu.com/s/1td8xN):
/home/q/hadoop-2.2.0/bin/hadoop jar \ /export1/tmp/wyp/OutputText.jar com.wyp.OutputTest \ /home/wyp/apat63_99.txt \ /home/wyp/out
运行完程序之后,可以去/home/wyp/out目录看下运行结果:
[wyp@l-datalog5.data.cn1 ~]$ /home/q/hadoop-2.2.0/bin/hadoop fs \ -ls /home/wyp/out .............................这里省略了很多................................... drwxr-xr-x - wyp supergroup 0 2013-11-26 14:25 /home/wyp/out/VE drwxr-xr-x - wyp supergroup 0 2013-11-26 14:25 /home/wyp/out/VG drwxr-xr-x - wyp supergroup 0 2013-11-26 14:25 /home/wyp/out/VN drwxr-xr-x - wyp supergroup 0 2013-11-26 14:25 /home/wyp/out/VU drwxr-xr-x - wyp supergroup 0 2013-11-26 14:25 /home/wyp/out/YE .............................这里省略了很多................................... -rw-r--r-- 3 wyp supergroup 0 2013-11-26 14:25 /home/wyp/out/_SUCCESS [wyp@l-datalog5.data.cn1 ~]$ /home/q/hadoop-2.2.0/bin/hadoop fs \ -ls /home/wyp/out/VN Found 2 items -rw-r--r-- 3 wyp supergroup 148 2013-11-26 14:25 /home/wyp/out/VN/part-00000 -rw-r--r-- 3 wyp supergroup 566 2013-11-26 14:25 /home/wyp/out/VN/part-00001 [wyp@l-datalog5.data.cn1 ~]$ /home/q/hadoop-2.2.0/bin/hadoop fs \ -cat /home/wyp/out/VN/part-00001 3430490,1969,3350,1965,"VN","",597185,6,,73,4,43,,0,,,,,,,,, 3630470,1971,4379,1970,"VN","",,1,,244,5,55,,4,,0.375,,22.5,,,,, 3654325,1972,4477,1969,"VN","",,1,,554,1,14,,0,,,,,,,,, 3665081,1972,4526,1970,"VN","",,1,,373,6,66,,1,,0,,3,,,,, 3772710,1973,5072,1972,"VN","",,1,,4,6,65,,1,,0,,8,,,,, 3821853,1974,5296,1971,"VN","",,1,,33,6,69,,1,,0,,23,,,,, 3824277,1974,5310,1970,"VN","",347650,3,,562,1,14,,2,,0.5,,9,,,,0,0 3918104,1975,5793,1972,"VN","",,1,2,4,6,65,5,0,0.4,,0,,18.2,,,,
从上面的结果可以看出,所有country相同的结果都输出到同一个文件夹下面了。MultipleOutputFormat对完全控制文件名和目录名很方便。大家也看到了上面的程序是基于行的split,如果我们要基于列的split,MultipleOutputFormat就无能为力了。这时MultipleOutputs就用上场了。MultipleOutputs在很早的版本就存在,那么我们先看看官方文档是怎么解释MultipleOutputs的:
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Hadoop多文件输出:MultipleOutputFormat和MultipleOutputs深究(一)】(https://www.iteblog.com/archives/842.html)