influxdata/telegraf

Logparser/tail input on Windows stops sendings logs due to file being locked

Open

#6539 opened on Oct 17, 2019

View on GitHub
 (8 comments) (0 reactions) (0 assignees)Go (9,892 stars) (4,161 forks)batch import
area/tailbughelp wantedplatform/windowssize/lupstream

Description

Relevant telegraf.conf:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://192.168.50.10:8086"]
  database = "telegraf"

[[inputs.logparser]]
  files = ["C:/Program Files/Apache/logs/All-AccessLog.*"]
  from_beginning = false

  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT} %{RESPONSE_TIME}"]
    measurement = "apache_access_log"

System info:

  • OS: Windows 10 (version 1903)
  • Telegraf: telegraf-1.12.3_windows_amd64

Steps to reproduce:

  1. Start a process to repeatedly request pages from the Apache server to ensure log lines keep being written to the access logs (I used watch -n 10 wget -q -r -P wget-out https://<url>)
  2. Start telegraf:
telegraf.exe --console --config <path to above config>
  1. In influx CLI, select the telegraf database and keep executing and watching the results of this query:
select * from apache_access_log order by time desc limit 10;

Expected behavior:

New data points should keep appearing in apache_access_log, shortly after each logline appears in the Apache log.

Actual behavior:

Observe that immediately after starting, fresh data will appear in apache_access_log, but that will stop after a shorter or longer period of time (but usually within a few minutes max), even though new log lines are appearing in the Apache log files.

Additional info:

I think I've reproduced the cause of this issue using the following simple Go program that uses the same code for tailing the log files as telegraf does:

package main

import (
	"log"
	"os"
	"github.com/influxdata/tail"
)

func main() {
	t, err := tail.TailFile("c:\\Program Files\\Apache\\logs\\All-AccessLog.2019-10-17", tail.Config{
		Follow:   true,
		ReOpen:   true,
		Location: &tail.SeekInfo{Offset: -200, Whence: os.SEEK_END}, // start 200 bytes from the end
		Logger:   log.New(os.Stdout, "[TailLog] ", log.LstdFlags),
		Poll:     true,
	})
	if err != nil {
		log.Fatal(err)
	}

	for line := range t.Lines {
		log.Println(line.Text)
	}
}

After some experimenting (printf-debugging, really) in the influxdata/tail code, I noticed that the tail would sooner or later die, as a result of a "non-EOF error". The output of the above program was:

...
[lines from the file that is being tailed]
...
[TailLog] 2019/10/17 16:01:26 non-EOF error: %!(EXTRA *os.PathError=read c:\Program Files\Apache\logs\All-AccessLog.2019-10-17: The process cannot access the file because another process has locked a portion of the file.)

This would happen at this place in the tail.go file in the tailFileSync func:

...
		line, err := tail.readLine()

		// Process `line` even if err is EOF.
		if err == nil {
...
		} else if err == io.EOF {
...
		} else {
			tail.Logger.Printf("non-EOF error: ", err) // Line added by me
			// non-EOF error
			tail.Killf("Error reading %s: %s", tail.Filename, err)
			return
		}

So it appears that the fact that Apache is writing additional lines to the file at the same moment telegraf logparser wants to read new lines is causing this issue.

(Initially I wanted to report this at https://github.com/influxdata/tail but that repository doesn't accept issues.)

Contributor guide