false sharing 研究

2015-01-07

引子

我们还是从实验开始，看下面这两端段程序：

public final class FalseSharing implements Runnable {
	public static int NUM_THREADS = 4; // change
	public final static long ITERATIONS = 500L * 1000L * 1000L;
	private final int arrayIndex;
	private static VolatileLong[] longs;

	public FalseSharing(final int arrayIndex) {
		this.arrayIndex = arrayIndex;
	}

	public static void main(final String[] args) throws Exception {
		Thread.sleep(10000);
		System.out.println("starting....");
		if (args.length == 1) {
			NUM_THREADS = Integer.parseInt(args[0]);
		}

		longs = new VolatileLong[NUM_THREADS];
		for (int i = 0; i < longs.length; i++) {
			longs[i] = new VolatileLong();
		}
		final long start = System.nanoTime();
		runTest();
		System.out.println("duration = " + (System.nanoTime() - start));
	}

	private static void runTest() throws InterruptedException {
		Thread[] threads = new Thread[NUM_THREADS];
		for (int i = 0; i < threads.length; i++) {
			threads[i] = new Thread(new FalseSharing(i));
		}
		for (Thread t : threads) {
			t.start();
		}
		for (Thread t : threads) {
			t.join();
		}
	}

	public void run() {
		long i = ITERATIONS + 1;
		while (0 != --i) {
			longs[arrayIndex].value = i;
		}
	}

	public final static class VolatileLong {
		public volatile long value = 0L;
//		public long p1, p2, p3, p4, p5, p6; //
//
//				public long sum() {
//					return p1 + p2 + p3 + p4 + p5 + p6;
//				}
	}
}

public final class NoFalseSharing implements Runnable {
	public static int NUM_THREADS = 4; // change
	public final static long ITERATIONS = 500L * 1000L * 1000L;
	private final int arrayIndex;
	private static VolatileLong[] longs;

	public NoFalseSharing(final int arrayIndex) {
		this.arrayIndex = arrayIndex;
	}

	public static void main(final String[] args) throws Exception {
		Thread.sleep(10000);
		System.out.println("starting....");
		if (args.length == 1) {
			NUM_THREADS = Integer.parseInt(args[0]);
		}

		longs = new VolatileLong[NUM_THREADS];
		for (int i = 0; i < longs.length; i++) {
			longs[i] = new VolatileLong();
		}
		final long start = System.nanoTime();
		runTest();
		System.out.println("duration = " + (System.nanoTime() - start));
	}

	private static void runTest() throws InterruptedException {
		Thread[] threads = new Thread[NUM_THREADS];
		for (int i = 0; i < threads.length; i++) {
			threads[i] = new Thread(new NoFalseSharing(i));
		}
		for (Thread t : threads) {
			t.start();
		}
		for (Thread t : threads) {
			t.join();
		}
	}

	public void run() {
		long i = ITERATIONS + 1;
		while (0 != --i) {
			longs[arrayIndex].value = i;
		}
	}

	public final static class VolatileLong {
		public volatile long value = 0L;
		public long p1, p2, p3, p4, p5, p6;
		public long sum() {
			return p1 + p2 + p3 + p4 + p5 + p6;
		}
	}
}

这两段程序只是 NoFalseSharing 填充了 7 个 long 型的数据，其他没有什么不同。我们来看看这俩段程序运行的时间

[root@centos101 hushi]# time java FalseSharing
starting....
duration = 34082260485

real	0m44.266s
user	2m12.387s
sys		0m0.052s
[root@centos101 hushi]# time java NoFalseSharing
starting....
duration = 6296462316

real	0m16.400s
user	0m25.026s
sys		0m0.039s
[root@centos101 hushi]#

很明显 NoFalseSharing 的运行时间比 FalseSharing 小了将近 6 倍。why？

问题分析

从上一篇文章我们知道现代多核 CPU 是存在多级 cache 的，那么存在两个问题

cpu 如何组织和管理这些缓存的？
cpu 是如何确保这么多 cache 中数据的一致性的？

对于第一个问题，我简单的描述下，cpu 对于 cache 的管理不可能是一个 byte 一个 byte 的管理的，因为这样效率就太低了。 cpu 将多个 byte 作为一个单员来管理，这个单员就叫做 cacheline，我们可以在 linux 下通过如下命令来查看一个 cacheline 的大小

[root@centos101 ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
[root@centos101 ~]#

可以看到一个 cacheline 的大小是 64 个字节。

对于第二个问题，CPU 各个核之间的通过一致性协议来访问 cache 中的数据，其中 MESI 协议是使用的最多的一种，如图：

M,E,S和I代表使用MESI协议时缓存行所处的四个状态：

M(修改, Modified)：本地处理器已经修改缓存行，即是脏行，它的内容与内存中的内容不一样. 并且此cache只有本地一个拷贝(专有)。
E(专有, Exclusive)：缓存行内容和内存中的一样，而且其它处理器都没有这行数据。
S(共享, Shared)：缓存行内容和内存中的一样,有可能其它处理器也存在此缓存行的拷贝
I(无效, Invalid)：缓存行失效，不能使用。

通过上面的介绍我们知道了 CPU 内部是如何组织和管理 cache 的，一个 cacheline 有 64 字节之多，那么当有两个线程都修改了一个 cacheline 中的两个不同的数据，根据 MESI 一致性协议，这个 cacheline 应该是失效的，应该和主存同步数据，这个如图所示：

那么这造成 L1 级 cache 不断的失效。

验证问题

由于 NoFalseSharing 填充了 7 个 long 型数据，正好能保证 value 属性在一个 cacheline中，效果如图所示

所以猜测应该是 NoFalseSharing L1 级 cache 的 miss 事件应该更少。既然是 L1 级 cache 会失效，那么我们来看看实验的结果：

[root@centos101 hushi]# perf stat -e L1-dcache-load-misses java FalseSharing
starting....
duration = 35808081578

 Performance counter stats for 'java FalseSharing':

       158,654,180 L1-dcache-misses

      45.920416219 seconds time elapsed

[root@centos101 hushi]# perf stat -e L1-dcache-load-misses java NoFalseSharing
starting....
duration = 6262425464

 Performance counter stats for 'java NoFalseSharing':

         3,403,174 L1-dcache-misses

      16.375648903 seconds time elapsed

[root@centos101 hushi]#

从上面的实验的数据中可以看到，确实是 NoFalseSharing 更少，大概相差 5 倍左右，这根实验的运行时间相差大概一致。

改进

我们知道了 false sharing 的产生的原因，由于 java 语言的特殊性，java 对象在内存中的布局，对象是要占一定的内存的，上面 NoFalseSharing 的填充方法有待改进。

import java.lang.reflect.Field;
import java.security.AccessController;
import java.security.PrivilegedExceptionAction;

import sun.misc.Unsafe;

public final class SuperNoFalseSharing implements Runnable {
	public static int NUM_THREADS = 4; // change
	public final static long ITERATIONS = 500L * 1000L * 1000L;
	private final int arrayIndex;
	private static VolatileLong[] longs;

	public SuperNoFalseSharing(final int arrayIndex) {
		this.arrayIndex = arrayIndex;
	}

	public static void main(final String[] args) throws Exception {
		Thread.sleep(10000);
		System.out.println("starting....");
		if (args.length == 1) {
			NUM_THREADS = Integer.parseInt(args[0]);
		}

		longs = new VolatileLong[NUM_THREADS];
		for (int i = 0; i < longs.length; i++) {
			longs[i] = new VolatileLong();
		}
		final long start = System.nanoTime();
		runTest();
		System.out.println("duration = " + (System.nanoTime() - start));
	}

	private static void runTest() throws InterruptedException {
		Thread[] threads = new Thread[NUM_THREADS];
		for (int i = 0; i < threads.length; i++) {
			threads[i] = new Thread(new SuperNoFalseSharing(i));
		}
		for (Thread t : threads) {
			t.start();
		}
		for (Thread t : threads) {
			t.join();
		}
	}

	public void run() {
		long i = ITERATIONS + 1;
		while (0 != --i) {
			longs[arrayIndex].set(i);
		}
	}

	public final static class VolatileLong {

		static final long INITIAL_VALUE = -1L;
	    private static final Unsafe UNSAFE;
	    private static final long VALUE_OFFSET;

	    static
	    {
	        UNSAFE = getUnsafe();
	        final int base = UNSAFE.arrayBaseOffset(long[].class);
	        final int scale = UNSAFE.arrayIndexScale(long[].class);
	        VALUE_OFFSET = base + (scale * 7);
	    }

	    private final long[] paddedValue = new long[15];

	    /**
	     * Create a sequence initialised to -1.
	     */
	    public VolatileLong()
	    {
	        this(INITIAL_VALUE);
	    }

	    private static Unsafe getUnsafe() {

	    	final Unsafe THE_UNSAFE;

	        {
	            try
	            {
	                final PrivilegedExceptionAction<Unsafe> action = new PrivilegedExceptionAction<Unsafe>()
	                {
	                    public Unsafe run() throws Exception
	                    {
	                        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
	                        theUnsafe.setAccessible(true);
	                        return (Unsafe) theUnsafe.get(null);
	                    }
	                };

	                THE_UNSAFE = AccessController.doPrivileged(action);
	            }
	            catch (Exception e)
	            {
	                throw new RuntimeException("Unable to load unsafe", e);
	            }
	        }
			return THE_UNSAFE;

		}

		/**
	     * Create a sequence with a specified initial value.
	     *
	     * @param initialValue The initial value for this sequence.
	     */
	    public VolatileLong(final long initialValue)
	    {
	        UNSAFE.putOrderedLong(paddedValue, VALUE_OFFSET, initialValue);
	    }

	    /**
	     * Perform a volatile read of this sequence's value.
	     *
	     * @return The current value of the sequence.
	     */
	    public long get()
	    {
	        return UNSAFE.getLongVolatile(paddedValue, VALUE_OFFSET);
	    }

	    /**
	     * Perform an ordered write of this sequence.  The intent is
	     * a Store/Store barrier between this write and any previous
	     * store.
	     *
	     * @param value The new value for the sequence.
	     */
	    public void set(final long value)
	    {
	        UNSAFE.putOrderedLong(paddedValue, VALUE_OFFSET, value);
	    }

	    /**
	     * Performs a volatile write of this sequence.  The intent is
	     * a Store/Store barrier between this write and any previous
	     * write and a Store/Load barrier between this write and any
	     * subsequent volatile read.
	     *
	     * @param value The new value for the sequence.
	     */
	    public void setVolatile(final long value)
	    {
	        UNSAFE.putLongVolatile(paddedValue, VALUE_OFFSET, value);
	    }

	    /**
	     * Perform a compare and set operation on the sequence.
	     *
	     * @param expectedValue The expected current value.
	     * @param newValue The value to update to.
	     * @return true if the operation succeeds, false otherwise.
	     */
	    public boolean compareAndSet(final long expectedValue, final long newValue)
	    {
	        return UNSAFE.compareAndSwapLong(paddedValue, VALUE_OFFSET, expectedValue, newValue);
	    }

	    /**
	     * Atomically increment the sequence by one.
	     *
	     * @return The value after the increment
	     */
	    public long incrementAndGet() {
	        return addAndGet(1L);
	    }

	    /**
	     * Atomically add the supplied value.
	     *
	     * @param increment The value to add to the sequence.
	     * @return The value after the increment.
	     */
	    public long addAndGet(final long increment) {
	        long currentValue;
	        long newValue;

	        do
	        {
	            currentValue = get();
	            newValue = currentValue + increment;
	        }
	        while (!compareAndSet(currentValue, newValue));

	        return newValue;
	    }

	    public String toString() {
	        return Long.toString(get());
	    }
	}
}

将 value 存放在 long 型数组中，左右两边各有 7 个元素，这样保证一定加载到一个 cacheline 中。这个技巧就是 disruptor 无锁队列使用的技巧。

关于我

85 后程序员, 比较熟悉 Java，JVM，Golang 相关技术栈，关注 Liunx kernel，目前痴迷于分布式系统的设计和实践。研究包括但不限于 Docker Kubernetes eBPF 等相关技术。

2015 新年快乐

false sharing 研究

引子

问题分析

验证问题

改进

关于我

Blog

Code

Life

Archive