月度归档:2015年06月

[转]Inside NGINX: How We Designed for Performance & Scale

NGINX leads the pack in web performance, and it’s all due to the way the software is designed. Whereas many web servers and application servers use a simple threaded or process-based architecture, NGINX stands out with a sophisticated event-driven architecture that enables it to scale to hundreds of thousands of concurrent connections on modern hardware.

The Inside NGINX infographic drills down from the high-level process architecture to illustrate how NGINX handles multiple connections within a single process. This blog explains how it all works in further detail.

Setting the Scene – the NGINX Process Model

Screen-Shot-2015-06-08-at-12.36.30-PM

To better understand this design, you need to understand how NGINX runs. NGINX has a master process (which performs the privileged operations such as reading configuration and binding to ports) and a number of worker and helper processes.


# service nginx restart
* Restarting nginx
# ps -ef --forest | grep nginx
root 32475 1 0 13:36 ? 00:00:00 nginx: master process /usr/sbin/nginx \
-c /etc/nginx/nginx.conf
nginx 32476 32475 0 13:36 ? 00:00:00 \_ nginx: worker process
nginx 32477 32475 0 13:36 ? 00:00:00 \_ nginx: worker process
nginx 32479 32475 0 13:36 ? 00:00:00 \_ nginx: worker process
nginx 32480 32475 0 13:36 ? 00:00:00 \_ nginx: worker process
nginx 32481 32475 0 13:36 ? 00:00:00 \_ nginx: cache manager process
nginx 32482 32475 0 13:36 ? 00:00:00 \_ nginx: cache loader process

On this 4-core server, the NGINX master process creates 4 worker processes and a couple of cache helper processes which manage the on-disk content cache.

Why Is Architecture Important?

The fundamental basis of any Unix application is the thread or process. (From the Linux OS perspective, threads and processes are mostly identical; the major difference is the degree to which they share memory.) A thread or process is a self-contained set of instructions that the operating system can schedule to run on a CPU core. Most complex applications run multiple threads or processes in parallel for two reasons:

  • They can use more compute cores at the same time.
  • Threads and processes make it very easy to do operations in parallel (for example, to handle multiple connections at the same time).

Processes and threads consume resources. They each use memory and other OS resources, and they need to be swapped on and off the cores (an operation called a context switch). Most modern servers can handle hundreds of small, active threads or processes simultaneously, but performance degrades seriously once memory is exhausted or when high I/O load causes a large volume of context switches.

The common way to design network applications is to assign a thread or process to each connection. This architecture is simple and easy to implement, but it does not scale when the application needs to handle thousands of simultaneous connections.

How Does NGINX Work?

NGINX uses a predictable process model that is tuned to the available hardware resources:

  • The master process performs the privileged operations such as reading configuration and binding to ports, and then creates a small number of child processes (the next three types).
  • The cache loader process runs at startup to load the disk-based cache into memory, and then exits. It is scheduled conservatively, so its resource demands are low.
  • The cache manager process runs periodically and prunes entries from the disk caches to keep them within the configured sizes.
  • The worker processes do all of the work! They handle network connections, read and write content to disk, and communicate with upstream servers.

The NGINX configuration recommended in most cases – running one worker process per CPU core – makes the most efficient use of hardware resources. You configure it by including the worker_processes auto directive in the configuration:

worker_processes auto;

When an NGINX server is active, only the worker processes are busy. Each worker process handles multiple connections in a non-blocking fashion, reducing the number of context switches.

Each worker process is single-threaded and runs independently, grabbing new connections and processing them. The processes can communicate using shared memory for shared cache data, session persistence data, and other shared resources.

Inside the NGINX Worker Process

 

Screen-Shot-2015-06-08-at-12.39.48-PM

Each NGINX worker process is initialized with the NGINX configuration and is provided with a set of listen sockets by the master process.

The NGINX worker processes begin by waiting for events on the listen sockets (accept_mutex and kernel socket sharding). Events are initiated by new incoming connections. These connections are assigned to a state machine – the HTTP state machine is the most commonly used, but NGINX also implements state machines for stream (raw TCP) traffic and for a number of mail protocols (SMTP, IMAP, and POP3).

Screen-Shot-2015-06-08-at-12.40.32-PM

The state machine is essentially the set of instructions that tell NGINX how to process a request. Most web servers that perform the same functions as NGINX use a similar state machine – the difference lies in the implementation.

Scheduling the State Machine

Think of the state machine like the rules for chess. Each HTTP transaction is a chess game. On one side of the chessboard is the web server – a grandmaster who can make decisions very quickly. On the other side is the remote client – the web browser that is accessing the site or application over a relatively slow network.

However, the rules of the game can be very complicated. For example, the web server might need to communicate with other parties (proxying to an upstream application) or talk to an authentication server. Third-party modules in the web server can even extend the rules of the game.

A Blocking State Machine

Recall our description of a process or thread as a self-contained set of instructions that the operating system can schedule to run on a CPU core. Most web servers and web applications use a process-per-connection orthread-per-connection model to play the chess game. Each process or thread contains the instructions to play one game through to the end. During the time the process is run by the server, it spends most of its time ‘blocked’ – waiting for the client to complete its next move.

Screen-Shot-2015-06-08-at-12.40.52-PM

  1. The web server process listens for new connections (new games initiated by clients) on the listen sockets.
  2. When it gets a new game, it plays that game, blocking after each move to wait for the client’s response.
  3. Once the game completes, the web server process might wait to see if the client wants to start a new game (this corresponds to a keepalive connection). If the connection is closed (the client goes away or a timeout occurs), the web server process returns to listening for new games.

The important point to remember is that every active HTTP connection (every chess game) requires a dedicated process or thread (a grandmaster). This architecture is simple and easy to extend with third-party modules (‘new rules’). However, there’s a huge imbalance: the rather lightweight HTTP connection, represented by a file descriptor and a small amount of memory, maps to a separate thread or process, a very heavyweight operating system object. It’s a programming convenience, but it’s massively wasteful.

NGINX is a True Grandmaster

Perhaps you’ve heard of simultaneous exhibition games, where one chess grandmaster plays dozens of opponents at the same time?

Kiril-Georgiev

Kiril Georgiev played 360 people simultaneously in Sofia, Bulgaria. His final score was 284 wins, 70 draws and 6 losses.

That’s how an NGINX worker process plays “chess.” Each worker (remember – there’s usually one worker for each CPU core) is a grandmaster that can play hundreds (in fact, hundreds of thousands) of games simultaneously.

 

Screen-Shot-2015-06-08-at-12.41.13-PM

  1. The worker waits for events on the listen and connection sockets.
  2. Events occur on the sockets and the worker handles them:
    • An event on the listen socket means that a client has started a new chess game. The worker creates a new connection socket.
    • An event on a connection socket means that the client has made a new move. The worker responds promptly.

A worker never blocks on network traffic, waiting for its “opponent” (the client) to respond. When it has made its move, the worker immediately proceeds to other games where moves are waiting to be processed, or welcomes new players in the door.

 Why Is This Faster than a Blocking, Multi-Process Architecture?

NGINX scales very well to support hundreds of thousands of connections per worker process. Each new connection creates another file descriptor and consumes a small amount of additional memory in the worker process. There is very little additional overhead per connection. NGINX processes can remain pinned to CPUs. Context switches are relatively infrequent and occur when there is no work to be done.

In the blocking, connection-per-process approach, each connection requires a large amount of additional resources and overhead, and context switches (swapping from one process to another) are very frequent.

For a more detailed explanation, check out this article about NGINX architecture, by Andrew Alexeev, VP of Corporate Development and Co-Founder at NGINX, Inc.

With appropriate system tuning, NGINX can scale to handle hundreds of thousands of concurrent HTTP connections per worker process, and can absorb traffic spikes (an influx of new games) without missing a beat.

Updating Configuration and Upgrading NGINX

NGINX’s process architecture, with a small number of worker processes, makes for very efficient updating of the configuration and even the NGINX binary itself.

 

Screen-Shot-2015-06-08-at-12.41.33-PM

Updating NGINX configuration is a very simple, lightweight, and reliable operation. It typically just means running thenginx –s reload command, which checks the configuration on disk and sends the master process a SIGHUP signal.

When the master process receives a SIGHUP, it does two things:

  1. Reloads the configuration and forks a new set of worker processes. These new worker processes immediately begin accepting connections and processing traffic (using the new configuration settings).
  2. Signals the old worker processes to gracefully exit. The worker processes stop accepting new connections. As soon as each current HTTP request completes, the worker process cleanly shuts down the connection (that is, there are no lingering keepalives). Once all connections are closed, the worker processes exit.

This reload process can cause a small spike in CPU and memory usage, but it’s generally imperceptible compared to the resource load from active connections. You can reload the configuration multiple times per second (and many NGINX users do exactly that). Very rarely, issues arise when there are many generations of NGINX worker processes waiting for connections to close, but even those are quickly resolved.

NGINX’s binary upgrade process achieves the holy grail of high-availability – you can upgrade the software on the fly, without any dropped connections, downtime, or interruption in service.

Screen-Shot-2015-06-08-at-12.41.51-PM

The binary upgrade process is similar in approach to the graceful reload of configuration. A new NGINX master process runs in parallel with the original master process, and they share the listening sockets. Both processes are active, and their respective worker processes handle traffic. You can then signal the old master and its workers to gracefully exit.

The entire process is described in more detail in Controlling NGINX.

Conclusion

The Inside NGINX infographic provides a high-level overview of how NGINX functions, but behind this simple explanation is over ten years of innovation and optimization that enable NGINX to deliver the best possible performance on a wide range of hardware while maintaining the security and reliability that modern web applications require.

If you’d like to read more about the optimizations in NGINX, check out these great resources:

原文:http://nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/

Golang多版本共存方案

如果你是一个狂热的Golang爱好者,也许你会同时使用两个或两个以上的Golang版本。那么这时候怎么实现呢?

这里提供一个Russ cox的方案。这个方案是在Google Group里看到的,但是我现在已经找不到具体是哪个链接了。


#file:/usr/bin/go1.4
#!/bin/bash
export GOROOT=$HOME/go1.4
export PATH=$HOME/go1.4/bin:$PATH
exec $GOROOT/bin/go "$@"

历史稳定版本可以这样处理,然后把开发版本作为默认的。

all done.

另:Russ cox 已经成为我的偶像啦

分享自己用Golang写的一个Web bench 工具

最近一直在学习Golang,业余时间开了个Web bench 工具awb,awb是another web bench的缩写,基本功能已完成,兼容ab的参数。

初步测试并发性比ab要好一些。感觉Golang除了在服务端开发方面有很大的优势以外,在写各类工具方面也是挺方便顺手的。估计未来会有很多写的运维工具使用Golang 编写。

总体上来说Go非常符合我的胃口。既有PHP的简洁、高效,在性能、并发、异步方面又有很好的系统级别的支持。整体上很有当年第一次看到PHP的时候的感觉。。。心动不已啊 🙂

项目地址:https://github.com/tomheng/awb

[转]How does the Go runtime work?

(This answer is for the Go compiler from golang.org. It is not about gccgo.)

The Go runtime is a Go package, which appears (in the documentation, the build process, the things it exports) to be like any other Go package. It is written in a combination of Go, C, and assembly. In order for it to do all the low level things it needs to do, there are special adaptations (hacks?) in the build system and in the compiler that are activated when compiling it. The runtime has architecture- and OS-specific things in it, which are compiled in according the the target platform.

At linking time, the linker creates a statically linked ELF (or PE or MachO) file which includes the runtime, your program, and every package that your program references. The starting address in the ELF header points into the runtime. It arranges for it’s own data structures to be initialized, then calls the initializers of all the packages (ordering of init is probably figured out at link time by the linker). It then transfers control to main.main, and your program is running. (There are situations where Go creates dynamically linked executables, but the majority of the code is still statically linked.)

When your program does things that could cause a crash, like a cast, or accessing an item in an array, the compiler emits calls into the runtime, which checks the data type with run time type info, or checks the bounds of the array. Likewise for memory allocation and for creating new goroutines, the runtime gets control. The runtime has a user-space scheduler in it, which implements cooperative multitasking; if a goroutine goes into a tight loop without calling any routines that would block (thereby giving the scheduler control) it can starve all the other goroutines. The runtime spawns a new system thread when needed to prevent the system from blocking on system calls. There can be fewer system threads in a Go system than the number of goroutines that are active.

The final aspect of the Go runtime that is very interesting is the per-goroutine stack. The runtime, together with the linker, arranges for the hardware stack to be non-contiguous, able to grow and shrink according to demand. As the stack shrinks after growing, it is freed and is available to be realloced as other types of objects by the memory allocator. This allows Go stacks to start very small (8 k), meaning that a Go program can launch hundreds of thousands of Goroutines without running out of address space. (The stack is becoming continuous in Go 1.5 but it can still be reallocated and moved when it runs out.)

When programming Go, the runtime is not in the front of your mind. You interact with the system library, and the runtime supports your code more or less silently. This is why the majority of information you’ll see about Go is how to use the libraries and how to use the channels to implement concurrent programming, and little about the runtime itself.

原文:http://www.quora.com/How-does-the-Go-runtime-work

Golang 使用心得-正确关闭channel

相信很多同学都是因为Golang有相当简单的并发模型才转过来的。Golang的并发模型主要由三部分构成:Goroutine、Channel和Select。

其中Channel是Goroutine之间通信的桥梁,简单优雅的解决了并行开发中的同步问题。

但是初学者(比如我)通常会遇到 send on closed channel的错误,这是因为我们向一个已经关闭channle推送数据造成的。通常这个错误是发生在生成消费模型中。channel用来传送需要执行的任务,channel一端是生产者,另一端是消费者。

如下代码:

package main

import (
"fmt"
"sync"
)

func main() {
jobs := make(chan int)
var wg sync.WaitGroup
wg.Add(10)
go func() {
for i := 0; i < 10; i++ { jobs <- i fmt.Println("produce:", i) } }() go func() { for i := range jobs { fmt.Println("consume:", i) wg.Done() } }() wg.Wait() }

注意上面的代码我们没有主动关闭channel,因为我一开始就知道执行的任务数,所以直接等待任务完成即可,没有必要主动关闭channel。

假设我们的场景中,生产者没有生成数量的限制,只有一个时间限制。那么代码演变成下面这样:

package main

import (
"fmt"
"sync"
"time"
)

func main() {
jobs := make(chan int)
var wg sync.WaitGroup
go func() {
time.Sleep(time.Second * 3)
close(jobs)
}()
go func() {
for i := 0; ; i++ {
jobs <- i fmt.Println("produce:", i) } }() wg.Add(1) go func() { defer wg.Done() for i := range jobs { fmt.Println("consume:", i) } }() wg.Wait() }

这时程序就会因为send on closed channel而崩溃。那么首先想到的解决方案可能是在放入jobs的时候检查jobs是否已关闭。幸好Golang有办法来检测channel是否被关闭,但是非常不好用。Golang中可以用comma ok的方式检测channel是否关闭,如下:

i, ok := <- jobs 如果channel关闭那么ok返回的是false,但是这样的话,如果jobs没有关闭,那么就会漏掉一个job。当然你可以想办好hack掉漏掉的这个job。但是这样的代码不是很优雅,也是那么直观合理。 在实践中我是用的如下代码处理这个问题。
package main

import (
"fmt"
"sync"
"time"
)

func main() {
jobs := make(chan int)
var wg sync.WaitGroup
timeout := make(chan bool)
go func() {
time.Sleep(time.Second * 3)
timeout <- true }() go func() { for i := 0; ; i++ { select { case <-timeout: close(jobs) return default: jobs <- i fmt.Println("produce:", i) } } }() wg.Add(1) go func() { defer wg.Done() for i := range jobs { fmt.Println("consume:", i) } }() wg.Wait() }

实际情况可能还要复杂很多,比如会有多个生产者,这个时候代码又该如何处理呢?大家可以想一想。。。