3/25に行われたGoConで"How Communicating Sequential Goroutines Work"という発表をしてきました。当初僕はCommunicating Sequential Processesについての話しをする予定だったのですが、時間内にとても発表できそうな内容ではなかったため、Concurrency全般についての話をしました。そのため、ここではその際触れられなかったgoroutineの実装の話しやCSPの話しなどを含めてGoのruntimeについて何回かに分けてまとめていきたいと思います。今回は主にgoroutineについてです。

GoのConcurrency

goroutineの説明に入る前にざっくりGoのConcurrencyについて説明します。

以下、GoConでの発表スライドにざっくりと沿いながら書いていきます。 speakerdeck.com

Goの有名な格言として

"Do not communicate by sharing memory; instead, share memory by communicating."

というものがあるのはみなさんご存じでしょう。これはgo-proverbsというGoらしいコードを書くためのことわざ集にも書いてあり、とても有名な格言かと思います。

この格言はConcurrencyを実現する上で一般的に用いられるShared-memory CommunicationとMessaging-passing Communicationの2つのモデルを並べて、Messaging-passing Communicationしようねと言っています。ただ、この格言を理解するためにはShared-memory CommunicationとMessaging-passing Communicationそれぞれの特徴とGoにおいてそれぞれの手法がどのように実現されているのかを理解しなければいけません。

Shared-memory CommunicationとMessaging-passing Communicationの違いなどは発表資料を参考にしてください。

Goの実装

Messaging-passing Communication

では、GoではMessaging-passing Communicationをどのように実現しているんでしょうか。ポイントとなるのは以下の機能です。

goroutine 2048byteの軽量なスレッドのようなもの
channel goroutine 間でのメッセージパッシングを行う
select 複数の channel の受信を同時に行う場合などに用いる

これらの機能を用いることによってGoではMessaging-passing Communicationをより簡単に使えるようになっています。今回の記事ではこの中のgoroutineに焦点を当てます。(続編でchannelについても扱う予定です)

なお、発表でも触れましたがselectの実践的な使い方に関しては牧さんの以下の発表が詳しいです。

いまさら聞けないselectあれこれ from lestrrat

www.slideshare.net

Shared-memory Communication

GoではShared-memory Communicationを実現するための方法としてsync.Mutexなどが提供されています。ただ、今回はMessaging-passingに焦点を当てるためここではあまり深くは扱いません。詳しくはドキュメントを読んでください、と言いたいところではありますが英語に抵抗がある方などは、少し前の記事にはなりますがmattnさんによるsyncパッケージについての記事などもあるのでそちらを読むといいと思います。 mattn.kaoriya.net

goroutineの実装

さて、やっと本題のgoroutineについてです。 goroutineを知るためにはsrc/runtime以下を読まなければいけません。runtimeのドキュメントは以下です。 runtime - The Go Programming Language

goroutineとはなんぞや？

軽量なスレッドのようなものですが、goroutineは最小で2048byteなのでWindows だと1MB、Linux だと2MB であるスレッドのデフォルトスタックサイズにくらべてとても軽量です。

また、OSスレッドはOSカーネルでスケジュールされており、スレッド間の制御を変更するには完全なコンテキストスイッチが必要なので遅くなってしまうのですが、goroutineは以下で説明するM:Nスレッド(LWP方式)を用いているため、スレッドの再スケジュールより低コストにスケジューリング可能という特徴があります。

スレッドの種類と特徴

スレッドには大きく分けてN:1と1:1、そしてM:N方式があります。

N：1

複数のユーザー空間スレッドが1つのOSスレッドで実行されます

長所コンテキストスイッチが非常に迅速である
短所マルチコアシステムを利用することができない

1：1

1つの実行スレッドが1つのOSスレッドと一致します

長所マシン上のすべてのコアを利用できる
短所トラップ(強制的割り込み)する必要があるため、コンテキストスイッチが遅い

M：N

任意の数のOSスレッドに任意の数のゴルーチンをスケジューリングします。

長所コンテキストスイッチをすばやく実行し、システム内のすべてのコアを活用できる
短所いいとこ取りだけどスケジューラーへの追加が煩雑でつらい

それぞれこれらのような特徴があるのですが、GoではM:N方式を採用しつつ、欠点であるスケジューラーへの追加の煩雑さをユーザーが意識せずに使えるようにしています。

登場人物説明

以下runtime以下の説明をしていくにあたって主要な登場人物が3人(!?)います。

G

type g struct {
    // Stack parameters.
    // stack describes the actual stack memory: [stack.lo, stack.hi).
    // stackguard0 is the stack pointer compared in the Go stack growth prologue.
    // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
    // stackguard1 is the stack pointer compared in the C stack growth prologue.
    // It is stack.lo+StackGuard on g0 and gsignal stacks.
    // It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
    stack       stack   // offset known to runtime/cgo
    stackguard0 uintptr // offset known to liblink
    stackguard1 uintptr // offset known to liblink

    _panic         *_panic // innermost panic - offset known to liblink
    _defer         *_defer // innermost defer
    m              *m      // current m; offset known to arm liblink
    sched          gobuf
    syscallsp      uintptr        // if status==Gsyscall, syscallsp = sched.sp to use during gc
    syscallpc      uintptr        // if status==Gsyscall, syscallpc = sched.pc to use during gc
    stktopsp       uintptr        // expected sp at top of stack, to check in traceback
    param          unsafe.Pointer // passed parameter on wakeup
    atomicstatus   uint32
    stackLock      uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
    goid           int64
    waitsince      int64  // approx time when the g become blocked
    waitreason     string // if status==Gwaiting
    schedlink      guintptr
    preempt        bool     // preemption signal, duplicates stackguard0 = stackpreempt
    paniconfault   bool     // panic (instead of crash) on unexpected fault address
    preemptscan    bool     // preempted g does scan for gc
    gcscandone     bool     // g has scanned stack; protected by _Gscan bit in status
    gcscanvalid    bool     // false at start of gc cycle, true if G has not run since last scan; TODO: remove?
    throwsplit     bool     // must not split stack
    raceignore     int8     // ignore race detection events
    sysblocktraced bool     // StartTrace has emitted EvGoInSyscall about this goroutine
    sysexitticks   int64    // cputicks when syscall has returned (for tracing)
    traceseq       uint64   // trace event sequencer
    tracelastp     puintptr // last P emitted an event for this goroutine
    lockedm        *m
    sig            uint32
    writebuf       []byte
    sigcode0       uintptr
    sigcode1       uintptr
    sigpc          uintptr
    gopc           uintptr // pc of go statement that created this goroutine
    startpc        uintptr // pc of goroutine function
    racectx        uintptr
    waiting        *sudog         // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
    cgoCtxt        []uintptr      // cgo traceback context
    labels         unsafe.Pointer // profiler labels
    timer          *timer         // cached timer for time.Sleep

    // Per-G GC state

    // gcAssistBytes is this G's GC assist credit in terms of
    // bytes allocated. If this is positive, then the G has credit
    // to allocate gcAssistBytes bytes without assisting. If this
    // is negative, then the G must correct this by performing
    // scan work. We track this in bytes to make it fast to update
    // and check for debt in the malloc hot path. The assist ratio
    // determines how this corresponds to scan work debt.
    gcAssistBytes int64
}

はい。この子がG(Goroutine)です。 g型で表されます。 goroutineが終了すると、その gは空いているgのpoolに戻されて、後で他のゴルーチンのために再利用されます。スタック内には、命令ポインタおよびゴルーチンのスケジューリングに重要な情報が含まれます。(例：ブロックされている可能性があるチャネルの情報等)

ちなみにこれらはsrc/runtime/runtime2.goというファイルに記述されています。 runtime以下には他にruntime.goとruntime1.goというファイルが存在します。歴史を感じますね…

M

type m struct {
    g0      *g     // goroutine with scheduling stack
    morebuf gobuf  // gobuf arg to morestack
    divmod  uint32 // div/mod denominator for arm - known to liblink

    // Fields not known to debuggers.
    procid        uint64     // for debuggers, but offset not hard-coded
    gsignal       *g         // signal-handling g
    sigmask       sigset     // storage for saved signal mask
    tls           [6]uintptr // thread-local storage (for x86 extern register)
    mstartfn      func()
    curg          *g       // current running goroutine
    caughtsig     guintptr // goroutine running during fatal signal
    p             puintptr // attached p for executing go code (nil if not executing go code)
    nextp         puintptr
    id            int32
    mallocing     int32
    throwing      int32
    preemptoff    string // if != "", keep curg running on this m
    locks         int32
    softfloat     int32
    dying         int32
    profilehz     int32
    helpgc        int32
    spinning      bool // m is out of work and is actively looking for work
    blocked       bool // m is blocked on a note
    inwb          bool // m is executing a write barrier
    newSigstack   bool // minit on C thread called sigaltstack
    printlock     int8
    incgo         bool // m is executing a cgo call
    fastrand      uint32
    ncgocall      uint64      // number of cgo calls in total
    ncgo          int32       // number of cgo calls currently in progress
    cgoCallersUse uint32      // if non-zero, cgoCallers in use temporarily
    cgoCallers    *cgoCallers // cgo traceback if crashing in cgo call
    park          note
    alllink       *m // on allm
    schedlink     muintptr
    mcache        *mcache
    lockedg       *g
    createstack   [32]uintptr // stack that created this thread.
    freglo        [16]uint32  // d[i] lsb and f[i]
    freghi        [16]uint32  // d[i] msb and f[i+16]
    fflag         uint32      // floating point compare flags
    locked        uint32      // tracking for lockosthread
    nextwaitm     uintptr     // next m waiting for lock
    needextram    bool
    traceback     uint8
    waitunlockf   unsafe.Pointer // todo go func(*g, unsafe.pointer) bool
    waitlock      unsafe.Pointer
    waittraceev   byte
    waittraceskip int
    startingtrace bool
    syscalltick   uint32
    thread        uintptr // thread handle

    // these are here because they are too large to be on the stack
    // of low-level NOSPLIT functions.
    libcall   libcall
    libcallpc uintptr // for cpu profiler
    libcallsp uintptr
    libcallg  guintptr
    syscall   libcall // stores syscall parameters on windows

    mOS
}

M(Machine)です。 OSスレッドを表します。これはOSによって管理される実行スレッドであり、標準のPOSIXスレッドのようなもので mで表されます。ユーザーのGoコード、runtimeのコード、syscallを実行しているか、idle状態になっています。複数のスレッドがシステムコールでブロックされる可能性があるため、一度に複数のMが存在する可能性があります。

P

type p struct {
    lock mutex

    id          int32
    status      uint32 // one of pidle/prunning/...
    link        puintptr
    schedtick   uint32   // incremented on every scheduler call
    syscalltick uint32   // incremented on every system call
    m           muintptr // back-link to associated m (nil if idle)
    mcache      *mcache
    racectx     uintptr

    deferpool    [5][]*_defer // pool of available defer structs of different sizes (see panic.go)
    deferpoolbuf [5][32]*_defer

    // Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
    goidcache    uint64
    goidcacheend uint64

    // Queue of runnable goroutines. Accessed without lock.
    runqhead uint32
    runqtail uint32
    runq     [256]guintptr
    // runnext, if non-nil, is a runnable G that was ready'd by
    // the current G and should be run next instead of what's in
    // runq if there's time remaining in the running G's time
    // slice. It will inherit the time left in the current time
    // slice. If a set of goroutines is locked in a
    // communicate-and-wait pattern, this schedules that set as a
    // unit and eliminates the (potentially large) scheduling
    // latency that otherwise arises from adding the ready'd
    // goroutines to the end of the run queue.
    runnext guintptr

    // Available G's (status == Gdead)
    gfree    *g
    gfreecnt int32

    sudogcache []*sudog
    sudogbuf   [128]*sudog

    tracebuf traceBufPtr

    palloc persistentAlloc // per-P to avoid mutex

    // Per-P GC state
    gcAssistTime     int64 // Nanoseconds in assistAlloc
    gcBgMarkWorker   guintptr
    gcMarkWorkerMode gcMarkWorkerMode

    // gcw is this P's GC work buffer cache. The work buffer is
    // filled by write barriers, drained by mutator assists, and
    // disposed on certain GC state transitions.
    gcw gcWork

    runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point

    pad [sys.CacheLineSize]byte
}

P(processor)です。スケジューラおよびメモリアロケータの状態など、ユーザGoコードを実行するために必要なリソースを表し、pと書かれます。

単一のスレッドでGoコードを実行するスケジューラのローカライズ版のようなものだとイメージしてください。つまり、pの内容はCPUごとの状態のように考えることができ、スレッド単位またはゴルーチン単位である必要はない状態を管理するのに適しています。 N：1スケジューラからM：Nスケジューラに移行することができる重要な部分です。

Pによって、Goプロセスの呼び出しを個々のコンピュータに合わせることができます。(4コアPCであれば、4つのスレッドでGoコードを実行するようになります) すごい子なんです。　

Pの数は起動時にGOMAXPROCS環境変数の値または実行時関数GOMAXPROCS（）によって設定されます。あとでも説明しますがGOMAXPROCSはあくまでもPの値であり、Mではありません。つまり、GOMAXPROCSが1でも複数のOSスレッドで実行されることはあります

スケジューラー

GとMとPがただ存在するだけでは我々の使っている並行処理機構は成立しえません。実行するコードであるG、実行する場所であるM、それを実行する権利とリソースであるPをうまく組み合わせてあげる必要があります。そこでスケジューラーの出番です。

ユーザースタックとシステムスタック

activeなGにはGoコードが実行する最小で2048byteのユーザースタックが関連付けられていて、動的に増減します。

全てのMにはそのユーザースタックにに関連するシステムスタックがstub Gとして実装されており、Mの "g0"スタックと呼ばれており、UnixではシグナルスタックがMの "gsignal"スタックとして実装されています。システムスタックとシグナルスタックは大きく拡大することはできませんが、ランタイムとcgoコードを実行するのに十分な大きさがあります。

runtimeのコードはsystemstack mcall asmcgocallなどを使用して、一時的にシステムスタックに切り替えてユーザーのゴルーチンを切り替えるタスクを実行します。システムスタック上で処理が実行されている間はユーザースタックは実行に使用されません。

goroutineが切り替わるタイミング

上記のタイミングでgoroutineの切り替え作業が行われるのですが、ざっくりとまとめると以下のようなタイミングで切り替えが行われます。

アンバッファなチャネルへの読み書きが行われる
システムコールが呼ばれるディスクI/Oとか待ちが入る余地がない即座に帰ってくる系のものだとスイッチしません
メモリの割り当てが行われる
time.Sleep()が呼ばれる
runtime.Gosched()が呼ばれる

GAEのようなGOMAXPROCS(=P)が1の時にこの切り替えが呼ばれない処理を並行処理しようとしても逐次処理と同じ動作をしてしまうので注意が必要です。

同期方法

runtimeには複数の同期メカニズムがあります。セマンティクス、特にゴルーチンスケジューラまたはOSスケジューラと相互作用するかどうかが異なります。

mutex

ロックとアンロックを使う単純な方法で、共有部分を短期間保護するために使用されます。 mutexを用いるとGoスケジューラとやりとりすることなくMが直接ブロックされるのでruntimeの最下位レベルから使用する分には安全ですが、関連付けられたGおよびPの再スケジューリングもブロックされてしまいます。

note

notesleepとnotewakeupというメソッドを持つnoteを使用するone-shot notificationsという手法があります。notesleepは、関連付けられたGとPの再スケジュールをブロックしてしまいますが、notetsleepgはブロックシステムコールのように動作し、別のGを実行するためにPを再利用できるようにします。こちらの手法はMを消費するのでGを直接ブロックする方法よりは効率的ではありません。

つまり、以下の様にまとめられます

Interface	G	M	P
mutex	Y	Y	Y
note	Y	Y	\|Y/N
park	Y	N	N

動作例

イメージが湧きやすいように例を示します。以下の図はThe Go scheduler - Morsing's blogから引用したものです。 G,M,Pはそれぞれ以下の様なアイコンで表します。 f:id:niconegoto:20170411085818p:plain

通常状態(GOMAXPROCS=2)

f:id:niconegoto:20170411085813p:plain

青色のGが実行中のGで、PはGo文が実行されるたびにrunqueuesというキューのリストからGをポップします。GOMAXPROCS=2なのでPは2つ存在します。 Pがスケジューリングポイント(メモリが一貫性をもつポイント)までゴルーチンを実行すると、その実行キューからGがポップされ、スタックと命令ポインタが設定され、ゴルーチンの実行が開始されます。このPの持つGのリストをローカルの実行キューと呼び、これ以外にグローバルの実行キューが存在します。 Pはローカル実行キューを使い果たしたときにグローバルキューからGを引き出します。旧バージョンのGoスケジューラではグローバル実行キューしかありませんでしたが、1.1からローカル実行キューが追加されました。

Pによるsteal

以下の図の左側では、片方のPからGが無くなって手持ちぶさたになってしまっています。

f:id:niconegoto:20170411085823p:plain

待機しているGが無くなるとPは別のPから実行キューの約半分を奪います。これによって、コンテキストごとに常に作業が行われるようになり、すべてのスレッドが最大容量で動作することができます。

syscall

以下はG0でsyscallが呼ばれた場合です。

f:id:niconegoto:20170411085754p:plain

Mはコードを実行していてシステムコールでブロックすることができないため、スケジューリングを維持できるようにPをハンドオフする必要があります。 M0はシステムコールを作成したgoroutineを保持したままPをリストにプッシュして、それをM1がポップして使用します。その状況が右図です。この左図から、GOMAXPROCS(=P)が1であってもGoプログラムは複数のスレッド(M)で実行されることがわかります。

最後に

いかがだったでしょうか。今回、メモリ割り当てやGC、コンパイラ命令の話しはあまり需要がなさそうなので触れませんでしたが、runtimeがGoの奥深さが詰まった世界であることが少しでも伝わったら幸いです。 runtimeを読んでGoにコミットをしてみるのも楽しいかなと思います。ちなみに、ランタイムエラーのデバッグでは、GOTRACEBACK = systemかGOTRACEBACK = crashで実行すると便利です。

現在続編としてchannelとCSPに関する記事を書いています。ゆっくりにはなりますが心が折れない限り書くのでお待ちください… This Week in Go commitsの方も二週間空いてしまいそうなので、気が向いたら更新したいです。

株式会社FlattではGoを書く仲間を募集しています！ランチからでも大丈夫なのでぜひ興味あるかたはTwitter Koki Ide (@niconegoto) | Twitterでもお問い合わせフォームにでもいいので連絡ください。

参考資料

The Go scheduler - Morsing's blog
Scalable Go Scheduler Design Doc - Google ドキュメント
WEB+DB vol.95 牧さんの連載
Hoare, C. A. R. (1978). Communicating sequential processes. In The origin of concurrent programming (pp. 413-443). Springer New York.

niconegoto Blog

niconegoto's journal

GoConで発表してきたのでついでにruntime以下の知識をまとめていく #golang #goroutine