flume学习（九）：自定义拦截器

about云腾讯认证空间

本帖最后由坎蒂丝_Swan 于 2015-3-19 22:11 编辑
问题导读

1.如何添加拦截器RegexExtractorExtInterceptor？
2.改动的内容中是如何增加两个配置参数？

还是针对学习八中的那个需求，我们现在换一种实现方式，采用拦截器来实现。

先回想一下，spooldir source可以将文件名作为header中的key:basename写入到event的header当中去。试想一下，如果有一个拦截器可以拦截这个event,然后抽取header中这个key的值，将其拆分成3段，每一段都放入到header中，这样就可以实现那个需求了。

遗憾的是，flume没有提供可以拦截header的拦截器。不过有一个抽取body内容的拦截器：RegexExtractorInterceptor，看起来也很强大，以下是一个官方文档的示例：

If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used

a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3

大概意思就是，通过这样的配置，event body中如果有1:2:3.4foobar5 这样的内容，这会通过正则的规则抽取具体部分的内容，然后设置到header当中去。

于是决定打这个拦截器的主义，觉得只要把代码稍微改改，从拦截body改为拦截header中的具体key，就OK了。翻开源码，哎呀，很工整，改起来没难度，以下是我新增的一个拦截器：RegexExtractorExtInterceptor：

package com.besttone.flume;

import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer;
import org.apache.flume.interceptor.RegexExtractorInterceptorSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.google.common.base.Charsets;
import com.google.common.base.Preconditions;
import com.google.common.base.Throwables;
import com.google.common.collect.Lists;

/**
 * Interceptor that extracts matches using a specified regular expression and
 * appends the matches to the event headers using the specified serializers</p>
 * Note that all regular expression matching occurs through Java's built in
 * java.util.regex package</p>. Properties:
 * <p>
 * regex: The regex to use
 * <p>
 * serializers: Specifies the group the serializer will be applied to, and the
 * name of the header that will be added. If no serializer is specified for a
 * group the default {@link RegexExtractorInterceptorPassThroughSerializer} will
 * be used
 * <p>
 * Sample config:
 * <p>
 * agent.sources.r1.channels = c1
 * <p>
 * agent.sources.r1.type = SEQ
 * <p>
 * agent.sources.r1.interceptors = i1
 * <p>
 * agent.sources.r1.interceptors.i1.type = REGEX_EXTRACTOR
 * <p>
 * agent.sources.r1.interceptors.i1.regex = (WARNING)|(ERROR)|(FATAL)
 * <p>
 * agent.sources.r1.interceptors.i1.serializers = s1 s2
 * agent.sources.r1.interceptors.i1.serializers.s1.type =
 * com.blah.SomeSerializer agent.sources.r1.interceptors.i1.serializers.s1.name
 * = warning agent.sources.r1.interceptors.i1.serializers.s2.type =
 * org.apache.flume.interceptor.RegexExtractorInterceptorTimestampSerializer
 * agent.sources.r1.interceptors.i1.serializers.s2.name = error
 * agent.sources.r1.interceptors.i1.serializers.s2.dateFormat = yyyy-MM-dd
 * </code>
 * </p>
 * 
 * <pre>
 * Example 1:
 * </p>
 * EventBody: 1:2:3.4foobar5</p> Configuration:
 * agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
 * </p>
 * agent.sources.r1.interceptors.i1.serializers = s1 s2 s3
 * agent.sources.r1.interceptors.i1.serializers.s1.name = one
 * agent.sources.r1.interceptors.i1.serializers.s2.name = two
 * agent.sources.r1.interceptors.i1.serializers.s3.name = three
 * </p>
 * results in an event with the the following
 * 
 * body: 1:2:3.4foobar5 headers: one=>1, two=>2, three=3
 * 
 * Example 2:
 * 
 * EventBody: 1:2:3.4foobar5
 * 
 * Configuration: agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
 * <p>
 * agent.sources.r1.interceptors.i1.serializers = s1 s2
 * agent.sources.r1.interceptors.i1.serializers.s1.name = one
 * agent.sources.r1.interceptors.i1.serializers.s2.name = two
 * <p>
 * 
 * results in an event with the the following
 * 
 * body: 1:2:3.4foobar5 headers: one=>1, two=>2
 * </pre>
 */
public class RegexExtractorExtInterceptor implements Interceptor {

        static final String REGEX = "regex";
        static final String SERIALIZERS = "serializers";

        // 增加代码开始

        static final String EXTRACTOR_HEADER = "extractorHeader";
        static final boolean DEFAULT_EXTRACTOR_HEADER = false;
        static final String EXTRACTOR_HEADER_KEY = "extractorHeaderKey";

        // 增加代码结束

        private static final Logger logger = LoggerFactory
                        .getLogger(RegexExtractorExtInterceptor.class);

        private final Pattern regex;
        private final List<NameAndSerializer> serializers;

        // 增加代码开始

        private final boolean extractorHeader;
        private final String extractorHeaderKey;

        // 增加代码结束

        private RegexExtractorExtInterceptor(Pattern regex,
                        List<NameAndSerializer> serializers, boolean extractorHeader,
                        String extractorHeaderKey) {
                this.regex = regex;
                this.serializers = serializers;
                this.extractorHeader = extractorHeader;
                this.extractorHeaderKey = extractorHeaderKey;
        }

        @Override
        public void initialize() {
                // NO-OP...
        }

        @Override
        public void close() {
                // NO-OP...
        }

        @Override
        public Event intercept(Event event) {
                String tmpStr;
                if(extractorHeader)
                {
                        tmpStr = event.getHeaders().get(extractorHeaderKey);
                }
                else
                {
                        tmpStr=new String(event.getBody(),
                                        Charsets.UTF_8);
                }
                
                Matcher matcher = regex.matcher(tmpStr);
                Map<String, String> headers = event.getHeaders();
                if (matcher.find()) {
                        for (int group = 0, count = matcher.groupCount(); group < count; group++) {
                                int groupIndex = group + 1;
                                if (groupIndex > serializers.size()) {
                                        if (logger.isDebugEnabled()) {
                                                logger.debug(
                                                                "Skipping group {} to {} due to missing serializer",
                                                                group, count);
                                        }
                                        break;
                                }
                                NameAndSerializer serializer = serializers.get(group);
                                if (logger.isDebugEnabled()) {
                                        logger.debug("Serializing {} using {}",
                                                        serializer.headerName, serializer.serializer);
                                }
                                headers.put(serializer.headerName, serializer.serializer
                                                .serialize(matcher.group(groupIndex)));
                        }
                }
                return event;
        }

        @Override
        public List<Event> intercept(List<Event> events) {
                List<Event> intercepted = Lists.newArrayListWithCapacity(events.size());
                for (Event event : events) {
                        Event interceptedEvent = intercept(event);
                        if (interceptedEvent != null) {
                                intercepted.add(interceptedEvent);
                        }
                }
                return intercepted;
        }

        public static class Builder implements Interceptor.Builder {

                private Pattern regex;
                private List<NameAndSerializer> serializerList;

                // 增加代码开始

                private boolean extractorHeader;
                private String extractorHeaderKey;

                // 增加代码结束

                private final RegexExtractorInterceptorSerializer defaultSerializer = new RegexExtractorInterceptorPassThroughSerializer();

                @Override
                public void configure(Context context) {
                        String regexString = context.getString(REGEX);
                        Preconditions.checkArgument(!StringUtils.isEmpty(regexString),
                                        "Must supply a valid regex string");

                        regex = Pattern.compile(regexString);
                        regex.pattern();
                        regex.matcher("").groupCount();
                        configureSerializers(context);

                        // 增加代码开始
                        extractorHeader = context.getBoolean(EXTRACTOR_HEADER,
                                        DEFAULT_EXTRACTOR_HEADER);

                        if (extractorHeader) {
                                extractorHeaderKey = context.getString(EXTRACTOR_HEADER_KEY);
                                Preconditions.checkArgument(
                                                !StringUtils.isEmpty(extractorHeaderKey),
                                                "必须指定要抽取内容的header key");
                        }
                        // 增加代码结束
                }

                private void configureSerializers(Context context) {
                        String serializerListStr = context.getString(SERIALIZERS);
                        Preconditions.checkArgument(
                                        !StringUtils.isEmpty(serializerListStr),
                                        "Must supply at least one name and serializer");

                        String[] serializerNames = serializerListStr.split("\\s+");

                        Context serializerContexts = new Context(
                                        context.getSubProperties(SERIALIZERS + "."));

                        serializerList = Lists
                                        .newArrayListWithCapacity(serializerNames.length);
                        for (String serializerName : serializerNames) {
                                Context serializerContext = new Context(
                                                serializerContexts.getSubProperties(serializerName
                                                                + "."));
                                String type = serializerContext.getString("type", "DEFAULT");
                                String name = serializerContext.getString("name");
                                Preconditions.checkArgument(!StringUtils.isEmpty(name),
                                                "Supplied name cannot be empty.");

                                if ("DEFAULT".equals(type)) {
                                        serializerList.add(new NameAndSerializer(name,
                                                        defaultSerializer));
                                } else {
                                        serializerList.add(new NameAndSerializer(name,
                                                        getCustomSerializer(type, serializerContext)));
                                }
                        }
                }

                private RegexExtractorInterceptorSerializer getCustomSerializer(
                                String clazzName, Context context) {
                        try {
                                RegexExtractorInterceptorSerializer serializer = (RegexExtractorInterceptorSerializer) Class
                                                .forName(clazzName).newInstance();
                                serializer.configure(context);
                                return serializer;
                        } catch (Exception e) {
                                logger.error("Could not instantiate event serializer.", e);
                                Throwables.propagate(e);
                        }
                        return defaultSerializer;
                }

                @Override
                public Interceptor build() {
                        Preconditions.checkArgument(regex != null,
                                        "Regex pattern was misconfigured");
                        Preconditions.checkArgument(serializerList.size() > 0,
                                        "Must supply a valid group match id list");
                        return new RegexExtractorExtInterceptor(regex, serializerList,
                                        extractorHeader, extractorHeaderKey);
                }
        }

        static class NameAndSerializer {
                private final String headerName;
                private final RegexExtractorInterceptorSerializer serializer;

                public NameAndSerializer(String headerName,
                                RegexExtractorInterceptorSerializer serializer) {
                        this.headerName = headerName;
                        this.serializer = serializer;
                }
        }
}
复制代码

简单说明一下改动的内容：

增加了两个配置参数：
extractorHeader 是否抽取的是header部分，默认为false,即和原始的拦截器功能一致，抽取的是event body的内容
extractorHeaderKey 抽取的header的指定的key的内容，当extractorHeader为true时，必须指定该参数。

按照第八讲的方法，我们将该类打成jar包，作为flume的插件放到了/var/lib/flume-ng/plugins.d/RegexExtractorExtInterceptor/lib目录下，重新启动flume，将该拦截器加载到classpath中。

最终的flume.conf如下：

tier1.sources=source1
tier1.channels=channel1
tier1.sinks=sink1
tier1.sources.source1.type=spooldir
tier1.sources.source1.spoolDir=/opt/logs
tier1.sources.source1.fileHeader=true
tier1.sources.source1.basenameHeader=true
tier1.sources.source1.interceptors=i1
tier1.sources.source1.interceptors.i1.type=com.besttone.flume.RegexExtractorExtInterceptor$Builder
tier1.sources.source1.interceptors.i1.regex=(.*)\\.(.*)\\.(.*)
tier1.sources.source1.interceptors.i1.extractorHeader=true
tier1.sources.source1.interceptors.i1.extractorHeaderKey=basename
tier1.sources.source1.interceptors.i1.serializers=s1 s2 s3
tier1.sources.source1.interceptors.i1.serializers.s1.name=one
tier1.sources.source1.interceptors.i1.serializers.s2.name=two
tier1.sources.source1.interceptors.i1.serializers.s3.name=three
tier1.sources.source1.channels=channel1
tier1.sinks.sink1.type=hdfs
tier1.sinks.sink1.channel=channel1
tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}
tier1.sinks.sink1.hdfs.round=true
tier1.sinks.sink1.hdfs.roundValue=10
tier1.sinks.sink1.hdfs.roundUnit=minute
tier1.sinks.sink1.hdfs.fileType=DataStream
tier1.sinks.sink1.hdfs.writeFormat=Text
tier1.sinks.sink1.hdfs.rollInterval=0
tier1.sinks.sink1.hdfs.rollSize=10240
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.idleTimeout=60
tier1.channels.channel1.type=memory
tier1.channels.channel1.capacity=10000
tier1.channels.channel1.transactionCapacity=1000
tier1.channels.channel1.keep-alive=30
复制代码

我把source type改回了内置的spooldir，而不是上一讲自定义的source,然后添加了一个拦截器i1,type是自定义的拦截器：

com.besttone.flume.RegexExtractorExtInterceptor$Builder,正则表达式按“.”分隔抽取三部分，分别放到header中的key:one,two,three当中去，即a.log.2014-07-31,通过拦截器后，在header当中就会增加三个key: one=a,two=log,three=2014-07-31。

这时候我们在tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}。
就实现了和前面第八讲一模一样的需求。

也可以看到，自定义拦截器的改动成本非常小，比自定义source小多了，我们这就增加了一个类，就实现了该功能。

flume学习（五）：flume将log4j日志数据写入到hdfs
flume学习（六）：使用hive来分析flume收集的日志数据
flume学习（七）、（八）：如何使用event header中的key值以及自定义source
flume学习（九）：自定义拦截器
flume学习（十）：使用Morphline Interceptor
flume学习（十一）：如何使用Spooling Directory Source

上公子 · 发表于 2015-3-23 00:33:50

支持一下，学习了楼主！！！

congra321 · 发表于 2016-3-15 12:24:11

支持一下，学习了楼主！！

图文精华

flume学习（九）：自定义拦截器

已有(2)人评论

推荐 /2