false sharing 研究
2015-01-07
引子
我们还是从实验开始,看下面这两端段程序:
public final class FalseSharing implements Runnable {
	public static int NUM_THREADS = 4; // change
	public final static long ITERATIONS = 500L * 1000L * 1000L;
	private final int arrayIndex;
	private static VolatileLong[] longs;
	public FalseSharing(final int arrayIndex) {
		this.arrayIndex = arrayIndex;
	}
	public static void main(final String[] args) throws Exception {
		Thread.sleep(10000);
		System.out.println("starting....");
		if (args.length == 1) {
			NUM_THREADS = Integer.parseInt(args[0]);
		}
		longs = new VolatileLong[NUM_THREADS];
		for (int i = 0; i < longs.length; i++) {
			longs[i] = new VolatileLong();
		}
		final long start = System.nanoTime();
		runTest();
		System.out.println("duration = " + (System.nanoTime() - start));
	}
	private static void runTest() throws InterruptedException {
		Thread[] threads = new Thread[NUM_THREADS];
		for (int i = 0; i < threads.length; i++) {
			threads[i] = new Thread(new FalseSharing(i));
		}
		for (Thread t : threads) {
			t.start();
		}
		for (Thread t : threads) {
			t.join();
		}
	}
	public void run() {
		long i = ITERATIONS + 1;
		while (0 != --i) {
			longs[arrayIndex].value = i;
		}
	}
	public final static class VolatileLong {
		public volatile long value = 0L;
//		public long p1, p2, p3, p4, p5, p6; //
//
//				public long sum() {
//					return p1 + p2 + p3 + p4 + p5 + p6;
//				}
	}
}
public final class NoFalseSharing implements Runnable {
	public static int NUM_THREADS = 4; // change
	public final static long ITERATIONS = 500L * 1000L * 1000L;
	private final int arrayIndex;
	private static VolatileLong[] longs;
	public NoFalseSharing(final int arrayIndex) {
		this.arrayIndex = arrayIndex;
	}
	public static void main(final String[] args) throws Exception {
		Thread.sleep(10000);
		System.out.println("starting....");
		if (args.length == 1) {
			NUM_THREADS = Integer.parseInt(args[0]);
		}
		longs = new VolatileLong[NUM_THREADS];
		for (int i = 0; i < longs.length; i++) {
			longs[i] = new VolatileLong();
		}
		final long start = System.nanoTime();
		runTest();
		System.out.println("duration = " + (System.nanoTime() - start));
	}
	private static void runTest() throws InterruptedException {
		Thread[] threads = new Thread[NUM_THREADS];
		for (int i = 0; i < threads.length; i++) {
			threads[i] = new Thread(new NoFalseSharing(i));
		}
		for (Thread t : threads) {
			t.start();
		}
		for (Thread t : threads) {
			t.join();
		}
	}
	public void run() {
		long i = ITERATIONS + 1;
		while (0 != --i) {
			longs[arrayIndex].value = i;
		}
	}
	public final static class VolatileLong {
		public volatile long value = 0L;
		public long p1, p2, p3, p4, p5, p6;
		public long sum() {
			return p1 + p2 + p3 + p4 + p5 + p6;
		}
	}
}
这两段程序只是 NoFalseSharing 填充了 7 个 long 型的数据,其他没有什么不同。 我们来看看这俩段程序运行的时间
[root@centos101 hushi]# time java FalseSharing
starting....
duration = 34082260485
real	0m44.266s
user	2m12.387s
sys		0m0.052s
[root@centos101 hushi]# time java NoFalseSharing
starting....
duration = 6296462316
real	0m16.400s
user	0m25.026s
sys		0m0.039s
[root@centos101 hushi]#
很明显 NoFalseSharing 的运行时间比 FalseSharing 小了将近 6 倍。why?
问题分析
从上一篇文章我们知道现代多核 CPU 是存在多级 cache 的,那么存在两个问题
- cpu 如何组织和管理这些缓存的?
- cpu 是如何确保这么多 cache 中数据的一致性的?
对于第一个问题,我简单的描述下,cpu 对于 cache 的管理不可能是一个 byte 一个 byte 的管理的,因为这样效率就太低了。 cpu 将多个 byte 作为一个单员来管理,这个单员就叫做 cacheline,我们可以在 linux 下通过如下命令来查看一个 cacheline 的大小
[root@centos101 ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
[root@centos101 ~]#
可以看到一个 cacheline 的大小是 64 个字节。
对于第二个问题,CPU 各个核之间的通过一致性协议来访问 cache 中的数据,其中 MESI 协议是使用的最多的一种,如图:

M,E,S和I代表使用MESI协议时缓存行所处的四个状态:
- M(修改, Modified):本地处理器已经修改缓存行,即是脏行,它的内容与内存中的内容不一样. 并且此cache只有本地一个拷贝(专有)。
- E(专有, Exclusive):缓存行内容和内存中的一样,而且其它处理器都没有这行数据 。
- S(共享, Shared):缓存行内容和内存中的一样,有可能其它处理器也存在此缓存行的拷贝
- I(无效, Invalid):缓存行失效,不能使用 。
通过上面的介绍我们知道了 CPU 内部是如何组织和管理 cache 的,一个 cacheline 有 64 字节之多, 那么当有两个线程都修改了一个 cacheline 中的两个不同的数据,根据 MESI 一致性协议,这个 cacheline 应该是失效的,应该和主存同步数据,这个如图所示:

那么这造成 L1 级 cache 不断的失效。
验证问题
由于 NoFalseSharing 填充了 7 个 long 型数据,正好能保证 value 属性在一个 cacheline中,效果如图所示

所以猜测应该是 NoFalseSharing L1 级 cache 的 miss 事件应该更少。既然是 L1 级 cache 会失效,那么我们来看看实验的结果:
[root@centos101 hushi]# perf stat -e L1-dcache-load-misses java FalseSharing
starting....
duration = 35808081578
 Performance counter stats for 'java FalseSharing':
       158,654,180 L1-dcache-misses
      45.920416219 seconds time elapsed
[root@centos101 hushi]# perf stat -e L1-dcache-load-misses java NoFalseSharing
starting....
duration = 6262425464
 Performance counter stats for 'java NoFalseSharing':
         3,403,174 L1-dcache-misses
      16.375648903 seconds time elapsed
[root@centos101 hushi]#
从上面的实验的数据中可以看到,确实是 NoFalseSharing 更少,大概相差 5 倍左右,这根实验的运行时间相差大概一致。
改进
我们知道了 false sharing 的产生的原因,由于 java 语言的特殊性,java 对象在内存中的布局,对象是要占一定的内存的,上面 NoFalseSharing 的填充方法有待改进。
import java.lang.reflect.Field;
import java.security.AccessController;
import java.security.PrivilegedExceptionAction;
import sun.misc.Unsafe;
public final class SuperNoFalseSharing implements Runnable {
	public static int NUM_THREADS = 4; // change
	public final static long ITERATIONS = 500L * 1000L * 1000L;
	private final int arrayIndex;
	private static VolatileLong[] longs;
	public SuperNoFalseSharing(final int arrayIndex) {
		this.arrayIndex = arrayIndex;
	}
	public static void main(final String[] args) throws Exception {
		Thread.sleep(10000);
		System.out.println("starting....");
		if (args.length == 1) {
			NUM_THREADS = Integer.parseInt(args[0]);
		}
		longs = new VolatileLong[NUM_THREADS];
		for (int i = 0; i < longs.length; i++) {
			longs[i] = new VolatileLong();
		}
		final long start = System.nanoTime();
		runTest();
		System.out.println("duration = " + (System.nanoTime() - start));
	}
	private static void runTest() throws InterruptedException {
		Thread[] threads = new Thread[NUM_THREADS];
		for (int i = 0; i < threads.length; i++) {
			threads[i] = new Thread(new SuperNoFalseSharing(i));
		}
		for (Thread t : threads) {
			t.start();
		}
		for (Thread t : threads) {
			t.join();
		}
	}
	public void run() {
		long i = ITERATIONS + 1;
		while (0 != --i) {
			longs[arrayIndex].set(i);
		}
	}
	public final static class VolatileLong {
		static final long INITIAL_VALUE = -1L;
	    private static final Unsafe UNSAFE;
	    private static final long VALUE_OFFSET;
	    static
	    {
	        UNSAFE = getUnsafe();
	        final int base = UNSAFE.arrayBaseOffset(long[].class);
	        final int scale = UNSAFE.arrayIndexScale(long[].class);
	        VALUE_OFFSET = base + (scale * 7);
	    }
	    private final long[] paddedValue = new long[15];
	    /**
	     * Create a sequence initialised to -1.
	     */
	    public VolatileLong()
	    {
	        this(INITIAL_VALUE);
	    }
	    private static Unsafe getUnsafe() {
	    	final Unsafe THE_UNSAFE;
	        {
	            try
	            {
	                final PrivilegedExceptionAction<Unsafe> action = new PrivilegedExceptionAction<Unsafe>()
	                {
	                    public Unsafe run() throws Exception
	                    {
	                        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
	                        theUnsafe.setAccessible(true);
	                        return (Unsafe) theUnsafe.get(null);
	                    }
	                };
	                THE_UNSAFE = AccessController.doPrivileged(action);
	            }
	            catch (Exception e)
	            {
	                throw new RuntimeException("Unable to load unsafe", e);
	            }
	        }
			return THE_UNSAFE;
		}
		/**
	     * Create a sequence with a specified initial value.
	     *
	     * @param initialValue The initial value for this sequence.
	     */
	    public VolatileLong(final long initialValue)
	    {
	        UNSAFE.putOrderedLong(paddedValue, VALUE_OFFSET, initialValue);
	    }
	    /**
	     * Perform a volatile read of this sequence's value.
	     *
	     * @return The current value of the sequence.
	     */
	    public long get()
	    {
	        return UNSAFE.getLongVolatile(paddedValue, VALUE_OFFSET);
	    }
	    /**
	     * Perform an ordered write of this sequence.  The intent is
	     * a Store/Store barrier between this write and any previous
	     * store.
	     *
	     * @param value The new value for the sequence.
	     */
	    public void set(final long value)
	    {
	        UNSAFE.putOrderedLong(paddedValue, VALUE_OFFSET, value);
	    }
	    /**
	     * Performs a volatile write of this sequence.  The intent is
	     * a Store/Store barrier between this write and any previous
	     * write and a Store/Load barrier between this write and any
	     * subsequent volatile read.
	     *
	     * @param value The new value for the sequence.
	     */
	    public void setVolatile(final long value)
	    {
	        UNSAFE.putLongVolatile(paddedValue, VALUE_OFFSET, value);
	    }
	    /**
	     * Perform a compare and set operation on the sequence.
	     *
	     * @param expectedValue The expected current value.
	     * @param newValue The value to update to.
	     * @return true if the operation succeeds, false otherwise.
	     */
	    public boolean compareAndSet(final long expectedValue, final long newValue)
	    {
	        return UNSAFE.compareAndSwapLong(paddedValue, VALUE_OFFSET, expectedValue, newValue);
	    }
	    /**
	     * Atomically increment the sequence by one.
	     *
	     * @return The value after the increment
	     */
	    public long incrementAndGet() {
	        return addAndGet(1L);
	    }
	    /**
	     * Atomically add the supplied value.
	     *
	     * @param increment The value to add to the sequence.
	     * @return The value after the increment.
	     */
	    public long addAndGet(final long increment) {
	        long currentValue;
	        long newValue;
	        do
	        {
	            currentValue = get();
	            newValue = currentValue + increment;
	        }
	        while (!compareAndSet(currentValue, newValue));
	        return newValue;
	    }
	    public String toString() {
	        return Long.toString(get());
	    }
	}
}
将 value 存放在 long 型数组中,左右两边各有 7 个元素,这样保证一定加载到一个 cacheline 中。 这个技巧就是 disruptor 无锁队列使用的技巧。
