分类目录归档:LAMP

PHP MYSQL LINUX APACHE

[转]GC学习笔记

这是我公司同事的GC学习笔记,写得蛮详细的,由浅入深,循序渐进,让人一看就懂,特转到这里。

一、GC特性以及各种GC的选择
1、垃圾回收器的特性
2、对垃圾回收器的选择
2.1 连续 VS. 并行
2.2 并发 VS. stop-the-world
2.3 压缩 VS. 不压缩 VS. 复制
二、GC性能指标
三、分代回收
四、J2SE 5.0的HotSpot JVM上的GC学习 – 分代、GC类型、快速分配
五、J2SE 5.0的HotSpot JVM上的GC学习 – SerialGC
六、J2SE 5.0的HotSpot JVM上的GC学习 – ParallelGC
七、J2SE 5.0的HotSpot JVM上的GC学习 – ParallelCompactingGC
八、J2SE 5.0的HotSpot JVM上的GC学习 – CMS GC
九、启动参数学习示例

1. GC特性以及各种GC的选择

1.1 垃圾回收器的特性

该回收的对象一定要回收,不该回收的对象一定不能回收
一定要有效,并且要快!尽可能少的暂停应用的运行
需要在时间,空间,回收频率这三个要素中平衡
内存碎片的问题(一种解决内存碎片的方法,就是压缩)
可扩展性和可伸缩性(内存的分配和回收,不应该成为跑在多核多线程应用上的瓶颈)
对垃圾回收器的选择

1.2 连续 VS. 并行

连续垃圾回收器,即使在多核的应用中,在回收时,也只有一个核被利用。
但并行GC会使用多核,GC任务会被分离成多个子任务,然后这些子任务在各个CPU上并行执行。
并行GC的好处是让GC的时间减少,但缺点是增加了复杂度,并且存在产生内存碎片的可能。

1.3 并发 VS. stop-the-world

当使用stop-the-world 方式的GC在执行时,整个应用会暂停住的。
而并发是指GC可以和应用一起执行,不用stop the world。
一般的说,并发GC可以做到大部分的运行时间,是可以和应用并发的,但还是有一些小任务,不得不短暂的stop the world。
stop the world 的GC相对简单,因为heap被冻结,对象的活动也已经停止。但缺点是可能不太满足对实时性要求很高的应用。
相应的,并发GC的stop the world时间非常短,并且需要做一些额外的事情,因为并发的时候,对象的引用状态有可能发生改变的。
所以,并发GC需要花费更多的时间,并且需要较大的heap。

1.4 压缩 VS. 不压缩 VS. 复制

在GC确定内存中哪些是有用的对象,哪些是可回收的对象之后,他就可以压缩内存,把拥有的对象放到一起,并把剩下的内存进行清理。
在压缩之后,分配对象就会快很多,并且内存指针可以很快的指向下一个要分配的内存地址。
一个不压缩的GC,就原地把不被引用的对象回收,他并没有对内存进行压缩。好处就是回收的速度变快了;缺点呢,就是产生了碎片。
一般来说,在有碎片的内存上分配一个对象的代价要远远大于在没有碎片的内存上分配。
另外的选择是使用一个复制算法的GC,他是把所有被引用的对象复制到另外一个内存区域中。
使用复制GC的好处就是,原来的内存区域,就可以被毫无顾忌的清空了。但缺点也很明显,需要更多的内存,以及额外的时间来复制。

2. GC性能指标

几个评估GC性能的指标
吞吐量 应用花在非GC上的时间百分比
GC负荷 与吞吐量相反,指应用花在GC上的时间百分比
暂停时间 应用花在GC stop-the-world 的时间
GC频率 顾名思义
Footprint 一些资源大小的测量,比如堆的大小
反应速度 从一个对象变成垃圾道这个对象被回收的时间
一个交互式的应用要求暂停时间越少越好,然而,一个非交互性的应用,当然是希望GC负荷越低越好。
一个实时系统对暂停时间和GC负荷的要求,都是越低越好。
一个嵌入式系统当然希望Footprint越小越好。

3. 分代回收

3.1 什么是分代

当使用分代回收技术,内存会被分为几个代(generation)。也就是说,按照对象存活的年龄,把对象放到不同的代中。
使用最广泛的代,应属年轻代和年老代了。
根据各种GC算法的特征,可以相应的被应用到不同的代中。
研究发现:
大部分的对象在分配后不久,就不被引用了。也就是,他们在很早就挂了。
只有很少的对象熬过来了。
年轻代的GC相当的频繁,高效率并且快。因为年轻代通常比较小,并且很多对象都是不被引用的。
如果年轻代的对象熬过来了,那么就晋级到年老代中了。如图:

0_1302698633x4qU

通常年老代要比年轻代大,而且增长也比较慢。所以GC在年老代发生的频率非常低,不过一旦发生,就会占据较长的时间。

3.2 总结

年轻代通常使用时间占优的GC,因为年轻代的GC非常频繁
年老代通常使用善于处理大空间的GC,因为年老代的空间大,GC频率低

4. J2SE 5.0的HotSpot JVM上的GC学习 – 分代、GC类型、快速分配

J2SE5.0 update 6 的HotSpot上有4个GC。

4.1 HotSpot上的分代

分成三部分:年轻代、年老代、永久代
很多的对象一开始是分配在年轻代的,这些对象在熬过了一定次数的young gc之后,就进入了年老代。同时,一些比较大的对象,一开始就可能被直接分配到年老代中(因为年轻代比较小嘛)。

4.2 年轻代

年轻代也进行划分,划分成:一个Eden和两个survivor。如下图:

0_130269864310w6

大部分的对象被直接分配到年轻代的eden区(之前已经提到了是,很大的对象会被直接分配到年老代中),
survivor区里面放至少熬过一个YGC的对象,在survivor里面的对象,才有机会被考虑提升到年老代中。
同一时刻,两个survivor只被使用一个,另外一个是用来进行复制GC时使用的。

4.3 GC类型

年轻代的GC叫young GC,有时候也叫 minor GC。年老代或者永久代的GC,叫 full GC,也叫major GC。
也就是说,所有的代都会进行GC。
一般的,首先是进行年轻代的GC,(使用针对年轻代的GC),然后是年老代和永久代使用相同的GC。如果要压缩(解决内存碎片问题),每个代需要分别压缩。
有时候,如果年老区本身就已经很满了,满到无法放下从survivor熬出来的对象,那么,YGC就不会再次触发,而是会使用FullGC对整个堆进行GC(除了CMS这种GC,因为CMS不能对年轻代进行GC)

4.4 快速分配内存

多线程进行对象建立的时候,在为对象分配内存的时候,就应该保证线程安全,为此,就应该进入全局锁。但全局锁是非常消耗性能的。
为此,HotSpot引入了Thread Local Allocation Buffers (TLAB)技术,这种技术的原理就是为每个线程分配一个缓冲,用来分配线程自己的对象。
每个线程只使用自己的TLAB,这样,就保证了不用使用全局锁。当TLAB不够用的时候,才需要使用全局锁。但这时候对锁的时候,频率已经相当的低了。
为了减少TLAB对空间的消耗,分配器也想了很多方法,平均来说,TLAB占用Eden区的不到1%。

5. J2SE 5.0的HotSpot JVM上的GC学习 – SerialGC

5.1 串行GC

串行GC,只使用单个CPU,并且会stop the world。

5.1.1 young 的串行GC

如下图:

0_13026986774jY9

当发生ygc的时候,Eden和From的survivor区会将被引用的对象复制到To这个survivor种。
如果有些对象在To survivor放不下,则直接升级到年老区。
当YGC完成后,内存情况如下图:

0_1302698666kzzx

5.1.2 old区的串行GC

年老区和永久区使用的是Mark-Sweep-Compact的算法。
mark阶段是对有引用的对象进行标识
sweep是对垃圾进行清理
compact对把活着的对象进行迁移,解决内存碎片的问题
如下图:

0_13026986774jY9

5.2 何时使用串行收集器

串行GC适用于对暂停时间要求不严,在客户端下使用。

5.3 串行收集器的选择

在J2SE5.0上,在非 server 模式下,JVM自动选择串行收集器。
也可以显示进行选择,在java启动参数中增加: -XX:+UseSerialGC 。

6. J2SE 5.0的HotSpot JVM上的GC学习 – ParallelGC

6.1 并行GC

现在已经有很多java应用跑在多核的机器上了。
并行的GC,也称作吞吐量GC,这种GC把多个CPU都用上了,不让CPU再空转。

6.2 YGC的并行GC

YGC的情况,还是使用stop-the-world + 复制算法的GC。
只不过是不再串行,而是充分利用多个CPU,减少GC负荷,增加吞吐量。
如下图,串行YGC和并行YGC的比较:

0_1302698681Noun

6.3 年老区的并行GC

也是和串行GC一样,在年老区和永久区使用Mark-Sweep-Compact,利用多核增加了吞吐量和减少GC负荷。

6.4 何时使用并行GC

对跑在多核的机器上,并且对暂停时间要求不严格的应用。因为频率较低,但是暂停时间较长的Full GC还是会发生的。

6.5 选择并行GC

在server模式下,并行GC会被自动选择。
或者可以显式选择并行GC,加启动JVM时加上参数: -XX:UseParallelGC

7. J2SE 5.0的HotSpot JVM上的GC学习 – ParallelCompactingGC

7.1 Parallel Compacting GC

parallelCompactingGC是在J2SE5.0 update6 引入的。
parallel compacting GC 与 parallel GC的不同地方,是在年老区的收集使用了一个新的算法。并且以后,parallel compacting GC 会取代 parallem GC的。

7.2 YGC的并行压缩GC

与并行GC使用的算法一样:stop-the-world 和 复制。

7.3 年老区的并行压缩GC

他将把年老区和永久区从逻辑上划分成等大的区域。
分为三个阶段:
标记阶段,使用多线程对存在引用的对象进行并行标记。
分析阶段,GC对各个区域进行分析,GC认为,在经过上次GC后,越左边的区域,有引用的对象密度要远远大于右边的区域。所以就从左往右分析,当某个区域的密度达到一个值的时候,就认为这是一个临界区域,所以这个临界区域左边的区域,将不会进行压缩,而右边的区域,则会进行压缩。
压缩阶段,多各个需要压缩的区域进行并行压缩。

7.4 什么时候使用并行压缩GC

同样的,适合在多核的机器上;并且此GC在FGC的时候,暂停时间会更短。
可以使用参数 -XX:ParallelGCThreads=n 来指定并行的线程数。

7.5 开启并行压缩GC

使用参数 -XX:+UseParallelOldGC

8. J2SE 5.0的HotSpot JVM上的GC学习 – CMS GC

8.1 Concurrent mark sweep GC

很多应用对响应时间的要求要大于吞吐量。
YGC并不暂停多少时间,但FGC对时间的暂用还是很长的。特别是在年老区使用的空间较多时。
因此, HotSpot引入了一个叫做CMS的收集器,也叫低延时收集器。

8.2 CMS的YGC

与并行GC同样的方式: stop-the-world 加上 copy。

8.3 CMS的FGC

CMS的FGC在大部分是和应用程序一起并发的!
CMS在FGC的时候,一开始需要做一个短暂的暂停,这个阶段称为最初标记:识别所有被引用的对象。
在并发标记时候,会和应用程序一起运行。
因为并发标记是和程序一起运行的,所以在并发标记结束的时候,不能保证所有被引用的对象都被标记,
为了解决这个问题,GC又进行了一次暂停,这个阶段称为:重标识(remark)。
在这个过程中,GC会重新对在并发标阶段时候有修改的对象做标记。
因为remark的暂停要大于最初标记,所以在这时候,需要使用多线程来并行标记。
在上述动作完成之后,就可以保证所有被引用的对象都被标记了。
因此,并发清理阶段就可以并发的收集垃圾了。
下图是serial gc 和 CMS gc 的对比:

0_1302698685pEKJ

因为要增加很多额外的动作,比如对被引用的对象重新标记,增加了CMS的工作量,所以他的GC负荷也相应的增加。
CMS是唯一没有进行压缩的GC。如下图:

0_1302698687O5F5

没有压缩,对于GC的过程,是节约了时间。但因此产生了内存碎片,所以对于新对象在年老区的分配,就产生了速度上的影响,
当然,也就包括了对YGC时间的影响了。
CMS的另一个缺点,就是他需要的堆比较大,因为在并发标记的时候和并发清除的时候,应用程序很有可能在不断产生新的对象,而垃圾又还没有被删除。
另外,在最初标记之后的并发标记时,原先被引用的对象,有可能变成垃圾。但在这一次的GC中,这是没有被删除的。这种垃圾叫做:漂流垃圾。
最后,由于没有进行压缩,由此而带来了内存碎片。
为了解决这个问题,CMS对热点object大小进行了统计,并且估算之后的需求,然后把空闲的内存进行拆分或者合并来满足后续的需求。
与其他的GC不同,CMS并不在年老区满了之后才开始GC,他需要提前进行GC,用以满足在GC同时需要额外的内存。
如果在GC的同时,内存不能满足要求了,则GC就变成了并行GC或者串行GC。
为了防止这种情况,会根据上一次GC的统计来确定启动时间。
或者是当年老区超过初始容量的话,CMS GC就会启动。
初始容量的设置可以在JVM启动时增加参数: -XX:CMSInitiatingOccupancyFraction=n
n是一个百分比,默认值为68。
总之,CMS比并行GC花费了更少的暂停时间,但是牺牲了吞吐量,以及需要更大的堆区。

8.4 额外模式

为了防止在并发标记的时候,GC线程长期占用CPU,CMS可以把并发标记的时候停下来,把cpu让给应用程序。
收集器会想并发标记分解成很小的时间串任务,在YGC之间来执行。
这个功能对于机器的CPU个数少,但又想降低暂停时间的应用来说,非常有用。

8.5 何时使用CMS

当CPU资源较空闲,并且需要很低的暂停时间时,可以选择CMS。比如 web servers。

8.6 选择CMS

选择CMS GC: 增加参数 -XX:UseConcMarkSweepGC
开启额外模式: 增加参数 -XX:+CMSIncreamentalMode

9. 结合线上启动参数学习

线上的启动参数
-Dprogram.name=run.sh -Xmx2g -Xms2g -Xmn256m -XX:PermSize=128m -Xss256k -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dcom.sun.management.config.file=/home/admin/web-deploy/conf/jmx/jmx_monitor_management.properties -Djboss.server.home.dir=/home/admin/web-deploy/jboss_server -Djboss.server.home.url=file\:/home/admin/web-deploy/jboss_server -Dapplication.codeset=GBK -Ddatabase.codeset=ISO-8859-1 -Ddatabase.logging=false -Djava.endorsed.dirs=/usr/alibaba/jboss/lib/endorsed

其中:
-Xmx2g -Xms2g 表示堆为2G
-Xmn256m 表示新生代为 256m
-Xss256k 设置每个线程的堆栈大小。JDK5.0以后每个线程堆栈大小为1M,以前每个线程堆栈大小为256K。更具应用的线程所需内存大小进行调整。在相同物理内存下,减小这个值能生成更多的线程。但是操作系统对一个进程内的线程数还是有限制的,不能无限生成,经验值在3000~5000左右
-XX:PermSize=128m 表示永久区为128m
-XX:+DisableExplicitGC 禁用显示的gc,程序程序中使用System.gc()中进行垃圾回收,使用这个参数后系统自动将 System.gc() 调用转换成一个空操作
-XX:+UseConcMarkSweepGC 表示使用CMS
-XX:+CMSParallelRemarkEnabled 表示并行remark
-XX:+UseCMSCompactAtFullCollection 表示在FGC之后进行压缩,因为CMS默认不压缩空间的。
-XX:+UseCMSInitiatingOccupancyOnly 表示只在到达阀值的时候,才进行CMS GC
-XX:CMSInitiatingOccupancyFraction=70 设置阀值为70%,默认为68%。
-XX:+UseCompressedOops JVM优化之压缩普通对象指针(CompressedOops),通常64位JVM消耗的内存会比32位的大1.5倍,这是因为对象指针在64位架构下,长度会翻倍(更宽的寻址)。对于那些将要从32位平台移植到64位的应用来说,平白无辜多了1/2的内存占用,这是开发者不愿意看到的。幸运的是,从JDK 1.6 update14开始,64 bit JVM正式支持了 -XX:+UseCompressedOops 这个可以压缩指针,起到节约内存占用的新参数.
关于-XX:+UseCMSInitiatingOccupancyOnly 和 -XX:CMSInitiatingOccupancyFraction ,详细解释见下:
The concurrent collection generally cannot be sped up but it can be started earlier.
A concurrent collection starts running when the percentage of allocated space in the old generation crosses a threshold. This threshold is calculated based on general experience with the concurrent collector. If full collections are occurring, the concurrent collections may need to be started earlier. The command line flag CMSInitiatingOccupancyFraction can be used to set the level at which the collection is started. Its default value is approximately 68%. The command line to adjust the value is
-XX:CMSInitiatingOccupancyFraction= The concurrent collector also keeps statistics on the promotion rate into the old generation for the application and makes a prediction on when to start a concurrent collection based on that promotion rate and the available free space in the old generation. Whereas the use of CMSInitiatingOccupancyFraction must be conservative to avoid full collections over the life of the application, the start of a concurrent collection based on the anticipated promotion adapts to the changing requirements of the application. The statistics that are used to calculate the promotion rate are based on the recent concurrent collections. The promotion rate is not calculated until at least one concurrent collection has completed so at least the first concurrent collection has to be initiated because the occupancy has reached CMSInitiatingOccupancyFraction . Setting CMSInitiatingOccupancyFraction to 100 would not cause only the anticipated promotion to be used to start a concurrent collection but would rather cause only non-concurrent collections to occur since a concurrent collection would not start until it was already too late. To eliminate the use of the anticipated promotions to start a concurrent collection set UseCMSInitiatingOccupancyOnly to true.
-XX:+UseCMSInitiatingOccupancyOnly
关于内存管理完整详细信息,请查看这份文档:http://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-150215.pdf

原文:http://blog.csdn.net/fenglibing/article/details/6321453

[转]The Go scheduler

One of the big features for Go 1.1 is the new scheduler, contributed by Dmitry Vyukov. The new scheduler has given a dramatic increase in performance for parallel Go programs and with nothing better to do, I figured I’d write something about it.

Most of what’s written in this blog post is already described in the original design doc. It’s a fairly comprehensive document, but pretty technical.

All you need to know about the new scheduler is in that design document but this post has pictures, so it’s clearly superior.

What does the Go runtime need with a scheduler?

But before we look at the new scheduler, we need to understand why it’s needed. Why create a userspace scheduler when the operating system can schedule threads for you?

The POSIX thread API is very much a logical extension to the existing Unix process model and as such, threads get a lot of the same controls as processes. Threads have their own signal mask, can be assigned CPU affinity, can be put into cgroups and can be queried for which resources they use. All these controls add overhead for features that are simply not needed for how Go programs use goroutines and they quickly add up when you have 100,000 threads in your program.

Another problem is that the OS can’t make informed scheduling decisions, based on the Go model. For example, the Go garbage collector requires that all threads are stopped when running a collection and that memory must be in a consistent state. This involves waiting for running threads to reach a point where we know that the memory is consistent.

When you have many threads scheduled out at random points, chances are that you’re going to have to wait for a lot of them to reach a consistent state. The Go scheduler can make the decision of only scheduling at points where it knows that memory is consistent. This means that when we stop for garbage collection, we only have to wait for the threads that are being actively run on a CPU core.

Our Cast of Characters

There are 3 usual models for threading. One is N:1 where several userspace threads are run on one OS thread. This has the advantage of being very quick to context switch but cannot take advantage of multi-core systems. Another is 1:1 where one thread of execution matches one OS thread. It takes advantage of all of the cores on the machine, but context switching is slow because it has to trap through the OS.

Go tries to get the best of both worlds by using a M:N scheduler. It schedules an arbitrary number of goroutines onto an arbitrary number of OS threads. You get quick context switches and you take advantage of all the cores in your system. The main disadvantage of this approach is the complexity it adds to the scheduler.

To acomplish the task of scheduling, the Go Scheduler uses 3 main entities:

our-cast

The triangle represents an OS thread. It’s the thread of execution managed by the OS and works pretty much like your standard POSIX thread. In the runtime code, it’s called M for machine.

The circle represents a goroutine. It includes the stack, the instruction pointer and other information important for scheduling goroutines, like any channel it might be blocked on. In the runtime code, it’s called a G.

The rectangle represents a context for scheduling. You can look at it as a localized version of the scheduler which runs Go code on a single thread. It’s the important part that lets us go from a N:1 scheduler to a M:N scheduler. In the runtime code, it’s called P for processor. More on this part in a bit.

in-motion

Here we see 2 threads (M), each holding a context (P), each running a goroutine (G). In order to run goroutines, a thread must hold a context.

The number of contexts is set on startup to the value of the GOMAXPROCS environment variable or through the runtime function GOMAXPROCS(). Normally this doesn’t change during execution of your program. The fact that the number of contexts is fixed means that only GOMAXPROCS are running Go code at any point. We can use that to tune the invocation of the Go process to the individual computer, such at a 4 core PC is running Go code on 4 threads.

The greyed out goroutines are not running, but ready to be scheduled. They’re arranged in lists called runqueues. Goroutines are added to the end of a runqueue whenever a goroutine executes a go statement. Once a context has run a goroutine until a scheduling point, it pops a goroutine off its runqueue, sets stack and instruction pointer and begins running the goroutine.

To bring down mutex contention, each context has its own local runqueue. A previous version of the Go scheduler only had a global runqueue with a mutex protecting it. Threads were often blocked waiting for the mutex to unlocked. This got really bad when you had 32 core machines that you wanted to squeeze as much performance out of as possible.

The scheduler keeps on scheduling in this steady state as long as all contexts have goroutines to run. However, there are a couple of scenarios that can change that.

Who you gonna (sys)call?

You might wonder now, why have contexts at all? Can’t we just put the runqueues on the threads and get rid of contexts? Not really. The reason we have contexts is so that we can hand them off to other threads if the running thread needs to block for some reason.

An example of when we need to block, is when we call into a syscall. Since a thread cannot both be executing code and be blocked on a syscall, we need to hand off the context so it can keep scheduling.

syscall

Here we see a thread giving up its context so that another thread can run it. The scheduler makes sure there are enough threads to run all contexts. M1 in the illustration above might be created just for the purpose of handling this syscall or it could come from a thread cache. The syscalling thread will hold on to the goroutine that made the syscall since it’s technically still executing, albeit blocked in the OS.

When the syscall returns, the thread must try and get a context in order to run the returning goroutine. The normal mode of operation is to steal a context from one of the other threads. If it can’t steal one, it will put the goroutine on a global runqueue, put itself on the thread cache and go to sleep.

The global runqueue is a runqueue that contexts pull from when they run out of their local runqueue. Contexts also periodically check the global runqueue for goroutines. Otherwise the goroutines on global runqueue could end up never running because of starvation.

This handling of syscalls is why Go programs run with multiple threads, even when GOMAXPROCS is 1. The runtime uses goroutines that call syscalls, leaving threads behind.

Stealing work

Another way that the steady state of the system can change is when a context runs out of goroutines to schedule to. This can happen if the amount of work on the contexts’ runqueues is unbalanced. This can cause a context to end up exhausting it’s runqueue while there is still work to be done in the system. To keep running Go code, a context can take goroutines out of the global runqueue but if there are no goroutines in it, it’ll have to get them from somewhere else.

steal

That somewhere is the other contexts. When a context runs out, it will try to steal about half of the runqueue from another context. This makes sure there is always work to do on each of the contexts, which in turn makes sure that all threads are working at their maximum capacity.

Where to go?

There are many more details to the scheduler, like cgo threads, the LockOSThread() function and integration with the network poller. These are outside the scope of this post, but still merit study. I might write about these later. There are certainly plenty of interesting constructions to be found in the Go runtime library.

By Daniel Morsing

 

原文:http://morsmachine.dk/go-scheduler

[转]How to write benchmarks in Go

This post continues a series on the testing package I started a few weeks back. You can read the previous article on writing table driven tests here. You can find the code mentioned below in the https://github.com/davecheney/fib repository.

Introduction

The Go testing package contains a benchmarking facility that can be used to examine the performance of your Go code. This post explains how to use the testing package to write a simple benchmark.

You should also review the introductory paragraphs of Profiling Go programs, specifically the section on configuring power management on your machine. For better or worse, modern CPUs rely heavily on active thermal management which can add noise to benchmark results.

Writing a benchmark

We’ll reuse the Fib function from the previous article.

func Fib(n int) int {
if n < 2 { return n } return Fib(n-1) + Fib(n-2) }

Benchmarks are placed inside _test.go files and follow the rules of their Test counterparts. In this first example we’re going to benchmark the speed of computing the 10th number in the Fibonacci series.

// from fib_test.go
func BenchmarkFib10(b *testing.B) {
// run the Fib function b.N times
for n := 0; n < b.N; n++ { Fib(10) } }

Writing a benchmark is very similar to writing a test as they share the infrastructure from the testing package. Some of the key differences are

Benchmark functions start with Benchmark not Test.
Benchmark functions are run several times by the testing package. The value of b.N will increase each time until the benchmark runner is satisfied with the stability of the benchmark. This has some important ramifications which we’ll investigate later in this article.
Each benchmark must execute the code under test b.N times. The for loop in BenchmarkFib10 will be present in every benchmark function.

Running benchmarks

Now that we have a benchmark function defined in the tests for the fib package, we can invoke it with go test -bench=.

% go test -bench=.
PASS
BenchmarkFib10 5000000 509 ns/op
ok github.com/davecheney/fib 3.084s

Breaking down the text above, we pass the -bench flag to go test supplying a regular expression matching everything. You must pass a valid regex to -bench, just passing -bench is a syntax error. You can use this property to run a subset of benchmarks.

The first line of the result, PASS, comes from the testing portion of the test driver, asking go test to run your benchmarks does not disable the tests in the package. If you want to skip the tests, you can do so by passing a regex to the -run flag that will not match anything. I usually use

go test -run=XXX -bench=.

The second line is the average run time of the function under test for the final value of b.N iterations. In this case, my laptop can execute Fib(10) in 509 nanoseconds. If there were additional Benchmark functions that matched the -bench filter, they would be listed here.

Benchmarking various inputs

As the original Fib function is the classic recursive implementation, we’d expect it to exhibit exponential behavior as the input grows. We can explore this by rewriting our benchmark slightly using a pattern that is very common in the Go standard library.

func benchmarkFib(i int, b *testing.B) {
for n := 0; n < b.N; n++ { Fib(i) } } func BenchmarkFib1(b *testing.B) { benchmarkFib(1, b) } func BenchmarkFib2(b *testing.B) { benchmarkFib(2, b) } func BenchmarkFib3(b *testing.B) { benchmarkFib(3, b) } func BenchmarkFib10(b *testing.B) { benchmarkFib(10, b) } func BenchmarkFib20(b *testing.B) { benchmarkFib(20, b) } func BenchmarkFib40(b *testing.B) { benchmarkFib(40, b) }

Making benchmarkFib private avoids the testing driver trying to invoke it directly, which would fail as its signature does not match func(*testing.B). Running this new set of benchmarks gives these results on my machine.

BenchmarkFib1 1000000000 2.84 ns/op
BenchmarkFib2 500000000 7.92 ns/op
BenchmarkFib3 100000000 13.0 ns/op
BenchmarkFib10 5000000 447 ns/op
BenchmarkFib20 50000 55668 ns/op
BenchmarkFib40 2 942888676 ns/op

Apart from confirming the exponential behavior of our simplistic Fib function, there are some other things to observe in this benchmark run.

Each benchmark is run for a minimum of 1 second by default. If the second has not elapsed when the Benchmark function returns, the value of b.N is increased in the sequence 1, 2, 5, 10, 20, 50, … and the function run again.
The final BenchmarkFib40 only ran two times with the average was just under a second for each run. As the testing package uses a simple average (total time to run the benchmark function over b.N) this result is statistically weak. You can increase the minimum benchmark time using the -benchtime flag to produce a more accurate result.

% go test -bench=Fib40 -benchtime=20s
PASS
BenchmarkFib40 50 944501481 ns/op

Traps for young players

Above I mentioned the for loop is crucial to the operation of the benchmark driver. Here are two examples of a faulty Fib benchmark.

func BenchmarkFibWrong(b *testing.B) {
for n := 0; n < b.N; n++ { Fib(n) } } func BenchmarkFibWrong2(b *testing.B) { Fib(b.N) }

On my system BenchmarkFibWrong never completes. This is because the run time of the benchmark will increase as b.N grows, never converging on a stable value. BenchmarkFibWrong2 is similarly affected and never completes.

A note on compiler optimisations

Before concluding I wanted to highlight that to be completely accurate, any benchmark should be careful to avoid compiler optimisations eliminating the function under test and artificially lowering the run time of the benchmark.

var result int

func BenchmarkFibComplete(b *testing.B) {
var r int
for n := 0; n < b.N; n++ { // always record the result of Fib to prevent // the compiler eliminating the function call. r = Fib(10) } // always store the result to a package level variable // so the compiler cannot eliminate the Benchmark itself. result = r }

Conclusion

The benchmarking facility in Go works well, and is widely accepted as a reliable standard for measuring the performance of Go code. Writing benchmarks in this manner is an excellent way of communicating a performance improvement, or a regression, in a reproducible way.

原文:http://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

通过浏览器更新SVN

简单脚本实现通过浏览器更新SVN。

header("Cache-Control:no-cache,must-revalidate");
$handle = popen('export LC_CTYPE=en_US.UTF-8 && /usr/bin/svn up --username user_test --password pass_test /data1/svn_repo 2>&1', 'r');
$read = stream_get_contents($handle);//需要 PHP5 或更高版本
echo "

";  
printf($read);  
echo "

";
pclose($handle);

途中踩了两个坑:

1.svn 有中文文件名
export LC_CTYPE=en_US.UTF-8 解决此问题

2.命令返回1,但是没有错误显示

2>&1 解决此问题

3.运行此命令要保证运行PHP的服务程序用户具有读写权限

省事的做法是通过此程序直接checkout一份。

参考:http://zyan.cc/post/371/

PHP 坑之 ==

PHP 中 “== ”判断两个变量是否相等的时候会进行自动的类型转换。今天Tom Hessman,在twitter上提到了PHP用==做判断的时候一种极端情况(很坑)。


//ok it return true
var_dump(md5('240610708') == md5('QNKCDZO'));
var_dump(md5('240610708'));
//string(32) "0e462097431906509019562988736854"
var_dump(md5('QNKCDZO'));
string(32) "0e830400451993494058024219903391"

PHP把两个字符串转换成科学计数来比较了。。。

严谨比较还是用 === 这个吧

PHP 中mysql_ping 函数小坑

官方文档是这样介绍的:

mysql_ping() 检查到服务器的连接是否正常。如果断开,则自动尝试连接。本函数可用于空闲很久的脚本来检查服务器是否关闭了连接,如果有必要则重新连接上。如果到服务器的连接可用则 mysql_ping() 返回 TRUE,否则返回 FALSE。

看一眼可能会以为mysql_ping 就两个返回值(TRUE | FALSE),但是实际情况是还有第三个返回值。这个问题官方文档下方是有人评论的,而且是在七年前:(

//When checking if a $resource works...
//be prepared that mysql_ping returns NULL as long as $resource is no correct mysql resource.
$resource =NULL;
var_dump = @mysql_ping($resource);
# showing NULL
//This could be used to decide of a current $resource is a mysql or a mysqli connection when //nothing else is available to do that...

越来越觉着PHP是个玩具型的东西。。。