...

数据竞争检测器

Introduction

前言

Data races are among the most common and hardest to debug types of bugs in concurrent systems. A data race occurs when two goroutines access the same variable concurrently and at least one of the accesses is a write. See the The Go Memory Model for details.

数据竞争是并发系统中最常见,同时也最难处理的Bug类型之一。数据竞争会在两个Go程并发访问同一个变量, 且至少有一个访问为写入时产生。更多详情见Go内存模型

Here is an example of a data race that can lead to crashes and memory corruption:

这个数据竞争的例子可导致程序崩溃和内存数据损坏(memory corruption)。

func main() {
	c := make(chan bool)
	m := make(map[string]string)
	go func() {
		m["1"] = "a" // First conflicting access.
		c <- true
	}()
	m["2"] = "b" // Second conflicting access.
	<-c
	for k, v := range m {
		fmt.Println(k, v)
	}
}
func main() {
	c := make(chan bool)
	m := make(map[string]string)
	go func() {
		m["1"] = "a"  // 第一个冲突的访问。
		c <- true
	}()
	m["2"] = "b"          // 第二个冲突的访问。
	<-c
	for k, v := range m {
		fmt.Println(k, v)
	}
}

Usage

使用

To help diagnose such bugs, Go includes a built-in data race detector. To use it, add the -race flag to the go command:

幸运的是,Go内建了数据竞争检测器。要使用它,请将 -race 标记添加到 go 命令之后:

$ go test -race mypkg    // to test the package
$ go run -race mysrc.go  // to run the source file
$ go build -race mycmd   // to build the command
$ go install -race mypkg // to install the package
$ go test -race mypkg    // 测试该包
$ go run -race mysrc.go  // 运行其源文件
$ go build -race mycmd   // 构建该命令
$ go install -race mypkg // 安装该包

Report Format

报告格式

When the race detector finds a data race in the program, it prints a report. The report contains stack traces for conflicting accesses, as well as stacks where the involved goroutines were created. Here is an example:

当竞争检测器在程序中找到数据竞争时,它会打印出一份报告。该报告包含冲突访问的栈跟踪, 以及创建相关Go程的栈。例如:

WARNING: DATA RACE
Read by goroutine 185:
  net.(*pollServer).AddFD()
      src/net/fd_unix.go:89 +0x398
  net.(*pollServer).WaitWrite()
      src/net/fd_unix.go:247 +0x45
  net.(*netFD).Write()
      src/net/fd_unix.go:540 +0x4d4
  net.(*conn).Write()
      src/net/net.go:129 +0x101
  net.func·060()
      src/net/timeout_test.go:603 +0xaf

Previous write by goroutine 184:
  net.setWriteDeadline()
      src/net/sockopt_posix.go:135 +0xdf
  net.setDeadline()
      src/net/sockopt_posix.go:144 +0x9c
  net.(*conn).SetDeadline()
      src/net/net.go:161 +0xe3
  net.func·061()
      src/net/timeout_test.go:616 +0x3ed

Goroutine 185 (running) created at:
  net.func·061()
      src/net/timeout_test.go:609 +0x288

Goroutine 184 (running) created at:
  net.TestProlongTimeout()
      src/net/timeout_test.go:618 +0x298
  testing.tRunner()
      src/testing/testing.go:301 +0xe8

Options

选项

The GORACE environment variable sets race detector options. The format is:

GORACE 环境变量设置了竞争检测的选项。其格式为:

GORACE="option1=val1 option2=val2"

The options are:

其中的选项为:

  • log_path (default stderr): The race detector writes its report to a file named log_path.pid. The special names stdout and stderr cause reports to be written to standard output and standard error, respectively.
  • exitcode (default 66): The exit status to use when exiting after a detected race.
  • strip_path_prefix (default ""): Strip this prefix from all reported file paths, to make reports more concise.
  • history_size (default 1): The per-goroutine memory access history is 32K * 2**history_size elements. Increasing this value can avoid a "failed to restore the stack" error in reports, at the cost of increased memory usage.
  • halt_on_error (default 0): Controls whether the program exits after reporting first data race.

Example:

例如:

$ GORACE="log_path=/tmp/race/report strip_path_prefix=/my/go/sources/" go test -race

Excluding Tests

排除测试

When you build with -race flag, the go command defines additional build tag race. You can use the tag to exclude some code and tests when running the race detector. Some examples:

当你用 -race 标记进行构建时,go命令定义了附加的 构建标记 race。 你可以通过它来排除某些竞争检测器下的代码/测试。例如:

// +build !race

package foo

// The test contains a data race. See issue 123.
func TestFoo(t *testing.T) {
	// ...
}

// The test fails under the race detector due to timeouts.
func TestBar(t *testing.T) {
	// ...
}

// The test takes too long under the race detector.
func TestBaz(t *testing.T) {
	// ...
}
// +build !race

package foo

// 此测试包含了数据竞争。见123号问题。
func TestFoo(t *testing.T)  {
	// ...
}

// 此测试会因为竞争检测器的超时而失败。
func TestBar(t *testing.T)  {
	// ...
}

// 此测试会在竞争检测器下花费太长时间。
func TestBaz(t *testing.T)  {
	// ...
}

How To Use

如何使用

To start, run your tests using the race detector (go test -race). The race detector only finds races that happen at runtime, so it can't find races in code paths that are not executed. If your tests have incomplete coverage, you may find more races by running a binary built with -race under a realistic workload.

首先,使用竞争检测器运行你的测试(go test -race)。 竞争检测器只会寻找在运行时发生的竞争,因此它不能在未执行的代码路径中寻找竞争。 若你的测试并未完全覆盖,你可以在实际的工作负载下运行通过 -race 编译的二进制程序,以此寻找更多的竞争。

Typical Data Races

典型的数据竞争

Here are some typical data races. All of them can be detected with the race detector.

以下是一些典型的数据竞争。它们均可通过竞争检测器进行检测。

Race on loop counter

循环计数器的竞争

func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	for i := 0; i < 5; i++ {
		go func() {
			fmt.Println(i) // Not the 'i' you are looking for.
			wg.Done()
		}()
	}
	wg.Wait()
}
func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	for i := 0; i < 5; i++ {
		go func() {
			fmt.Println(i)  // 你要找的不是“i”。
			wg.Done()
		}()
	}
	wg.Wait()
}

The variable i in the function literal is the same variable used by the loop, so the read in the goroutine races with the loop increment. (This program typically prints 55555, not 01234.) The program can be fixed by making a copy of the variable:

此函数字面中的变量 i 与该循环中使用的是同一个变量, 因此该Go程中对它的读取与该递增循环产生了竞争。(此程序通常会打印55555,而非01234。) 此程序可通过创建该变量的副本来修复。

func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	for i := 0; i < 5; i++ {
		go func(j int) {
			fmt.Println(j) // Good. Read local copy of the loop counter.
			wg.Done()
		}(i)
	}
	wg.Wait()
}
func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	for i := 0; i < 5; i++ {
		go func(j int) {
			fmt.Println(j)  // 很好。现在读取的是该循环计数器的局部副本。
			wg.Done()
		}(i)
	}
	wg.Wait()
}

Accidentally shared variable

偶然被共享的变量

// ParallelWrite writes data to file1 and file2, returns the errors.
func ParallelWrite(data []byte) chan error {
	res := make(chan error, 2)
	f1, err := os.Create("file1")
	if err != nil {
		res <- err
	} else {
		go func() {
			// This err is shared with the main goroutine,
			// so the write races with the write below.
			_, err = f1.Write(data)
			res <- err
			f1.Close()
		}()
	}
	f2, err := os.Create("file2") // The second conflicting write to err.
	if err != nil {
		res <- err
	} else {
		go func() {
			_, err = f2.Write(data)
			res <- err
			f2.Close()
		}()
	}
	return res
}
// ParallelWrite 将数据写入 file1 和 file2 中,并返回一个错误。
func ParallelWrite(data []byte) chan error {
	res := make(chan error, 2)
	f1, err := os.Create("file1")
	if err != nil {
		res <- err
	} else {
		go func() {
			// 此处的 err 是与主Go程共享的,
			// 因此该写入操作就会与下面的写入操作产生竞争。
			_, err = f1.Write(data)
			res <- err
			f1.Close()
		}()
	}
	f2, err := os.Create("file2")  // 第二个冲突的对 err 的写入。
	if err != nil {
		res <- err
	} else {
		go func() {
			_, err = f2.Write(data)
			res <- err
			f2.Close()
		}()
	}
	return res
}

The fix is to introduce new variables in the goroutines (note the use of :=):

其解决方案就是在该Go程中引入新的变量(注意对 := 的使用):

			...
			_, err := f1.Write(data)
			...
			_, err := f2.Write(data)
			...

Unprotected global variable

不受保护的全局变量

If the following code is called from several goroutines, it leads to races on the service map. Concurrent reads and writes of the same map are not safe:

若以下代码在多个Go程中调用,就会导致 service 映射产生竞争。 对映射的并发读写是不安全的:

var service map[string]net.Addr

func RegisterService(name string, addr net.Addr) {
	service[name] = addr
}

func LookupService(name string) net.Addr {
	return service[name]
}

To make the code safe, protect the accesses with a mutex:

要保证此代码的安全,需通过互斥锁来保护对它的访问:

var (
	service   map[string]net.Addr
	serviceMu sync.Mutex
)

func RegisterService(name string, addr net.Addr) {
	serviceMu.Lock()
	defer serviceMu.Unlock()
	service[name] = addr
}

func LookupService(name string) net.Addr {
	serviceMu.Lock()
	defer serviceMu.Unlock()
	return service[name]
}

Primitive unprotected variable

不受保护的基原类型变量

Data races can happen on variables of primitive types as well (bool, int, int64, etc.), as in this example:

数据竞争同样会发生在基原类型的变量上(如 boolintint64 等),就像下面这样:

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
	w.last = time.Now().UnixNano() // First conflicting access.
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			// Second conflicting access.
			if w.last < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}
type Watchdog struct { last int64 }

func (w *Watchdog) KeepAlive() {
	w.last = time.Now().UnixNano()  // 第一个冲突的访问。
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			// 第二个冲突的访问。
			if w.last < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}

Even such "innocent" data races can lead to hard-to-debug problems caused by non-atomicity of the memory accesses, interference with compiler optimizations, or reordering issues accessing processor memory .

甚至“无辜”的数据竞争也会导致难以调试的问题:(1) 非原子性的内存访问 (2) 编译器优化的干扰以及 (3) 进程内存访问的重排序问题。

A typical fix for this race is to use a channel or a mutex. To preserve the lock-free behavior, one can also use the sync/atomic package.

对此,典型的解决方案就是使用信道或互斥锁。要保护无锁的行为,一种方法就是使用 sync/atomic 包。

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
	atomic.StoreInt64(&w.last, time.Now().UnixNano())
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			if atomic.LoadInt64(&w.last) < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}

Supported Systems

支持的系统

The race detector runs on darwin/amd64, linux/amd64, and windows/amd64.

竞争检测器可运行在 darwin/amd64linux/amd64windows/amd64 上。

Runtime Overhead

运行时开销

The cost of race detection varies by program, but for a typical program, memory usage may increase by 5-10x and execution time by 2-20x.

竞争检测的代价因程序而异,但对于典型的程序,内存的使用会增加5到10倍, 而执行时间会增加2到20倍。