关于flume的exec方式实时同步日志数据问题

本帖最后由 ximenchuixuesun 于 2015-7-29 17:47 编辑

请教一下，如果是用flume的exec方式去实时监控日志文件的话，那就意味着只能监控一个文件，那么如果我的日志文件是用tomcat生成的日志文件，而且是每天生成一个日志文件，如果是这样的话，我如何去实现exec的实时监控呢，是要tomcat不做每天分割而生成一个大日志文件不断追加日志内容么，大家用exec的方式是如何对日志文件做处理的呢？

Alkaloid0515 · 发表于 2015-7-29 17:59:46

如果每天生成新的文件，可以采用另外的数据源spooldir

对于直接读取文件Source,有两种方式：

ExecSource:以运行Linux命令的方式，持续的输出最新的数据，如tail -F 文件名指令，在这种方式下，取的文件名必须是指定的。 ExecSource可以实现对日志的实时收集，但是存在Flume不运行或者指令执行出错时，将无法收集到日志数据，无法保证日志数据的完整性。
SpoolSource:监测配置的目录下新增的文件，并将文件中的数据读取出来。

分布式日志收集系统Apache Flume的设计详细介绍
http://www.aboutyun.com/thread-7848-1-1.html

ximenchuixuesun · 发表于 2015-7-30 09:46:43

但是如果是采用exec方式的话，文件固定，那么这个文件就会越来越大，如果是这种情况该如何处理呢？是定期对文件进行分割么

Alkaloid0515 · 发表于 2015-7-30 11:05:46

ximenchuixuesun 发表于 2015-7-30 09:46
但是如果是采用exec方式的话，文件固定，那么这个文件就会越来越大，如果是这种情况该如何处理呢？是定期对 ...

人家就是这么定义的，你不想无限大，就采用目录的方式

参考官网

内容如下：
Eexec SourceExec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as datewill probably not - the former two commands produce streams of data where as the latter produces a single event and exits.
Required properties are in bold.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be exec
command	–	The command to execute
shell	–	A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle	10000	Amount of time (in millis) to wait before attempting a restart
restart	false	Whether the executed cmd should be restarted if it dies
logStdErr	false	Whether the command’s stderr should be logged
batchSize	20	The max number of lines to read and send to the channel at a time
batchTimeout	3000	Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

Warning

The problem with ExecSource and other asynchronous sources is that the source can not guarantee that if there is a failure to put the event into the Channel the client knows about it. In such cases, the data will be lost. As a for instance, one of the most commonly requested features is the tail -F [file]-like use case where an application writes to a log file on disk and Flume tails the file, sending each line as an event. While this is possible, there’s an obvious problem; what happens if the channel fills up and Flume can’t send an event? Flume has no way of indicating to the application writing the log file that it needs to retain the log or that the event hasn’t been sent, for some reason. If this doesn’t make sense, you need only know this: Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource! As an extension of this warning - and to be completely clear - there is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling Directory Source or direct integration with Flume via the SDK.

Note

You can use ExecSource to emulate TailSource from Flume 0.9x (flume og). Just use unix command tail -F /full/path/to/your/file. Parameter -F is better in this case than -f as it will also follow file rotation.

Example for agent named a1:
[mw_shl_code=bash,true]a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1[/mw_shl_code]

The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or Powershell). The ‘command’ is passed as an argument to ‘shell’ for execution. This allows the ‘command’ to use features from the shell such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of the ‘shell’ config, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.
[mw_shl_code=bash,true]a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done[/mw_shl_code]

ximenchuixuesun · 发表于 2015-7-30 11:35:02

我在用spooldir方式收集日志的时候，由于拷贝到监控目录的文件有点大，出现了边读边写的错误，请问有什么解决方案么

zxmit · 发表于 2015-8-14 00:15:45

文件默认情况下不会读取.COMPLETED结尾的文件。把文件移到监控目录后再去掉后缀名就可以了

zxmit · 发表于 2015-8-14 00:18:29

exec不仅仅监控一个文件而已，我们还可以指定一个可执行的shell脚本。你再认真看一下官方文档。

ximenchuixuesun · 发表于 2015-8-15 10:28:12

嗯，好的，谢谢，我看一下

图文精华

关于flume的exec方式实时同步日志数据问题

相关帖子

已有(7)人评论

推荐 /2