19. ANALYTICS
• Hadoop/Hive based
• latencyが厳しい(そうです)
• daily basedなbatch処理のリスク
Analytics based on Hadoop/Hive
Hourly Daily
12hで終わらなかったら?
seconds seconds Pipeline Jobs
•
Copier/Loader
HTTP Scribe NFS Hive MySQL
Hadoop
• 3000-node Hadoop cluster
• Copier/Loader: Map-Reduce hides machine failures
• Pipeline Jobs: Hive allows SQL-like syntax
• Good scalability, but poor latency! 24 – 48 hours.
20. ANALYTICS
• Scribe + PTail + Puma + HBase
• リアルタイム処理: リスク回避の側面
• Will be Open Sourced? Puma3 Architecture
PTail Puma3 HBase
• Read workflow
Serving
▪ Read uncommitted: directly serve from the in-memory hashmap; load
from Hbase on miss.
▪ Read committed: read from HBase and serve.
21. ANALYTICS
• Yet...
Currently HBase resharding is done manually.
• Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.
• Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.
Tailer Hot Spots
• In a distributed system there's a chance one part of the system can be hotter than another.
• One example are region servers that can be hot because more keys are being directed that way.
Top Lists
• Very hard to find the top URLs, the URLs with the most likes, for domains like YouTube which have millions
of URLs shared very quickly.
• Need more creative solutions to keep an in-memory sorts and keep it up to date as data changes.
• And arbitrary queries? / MySQL still wins?
22. OTHER PARTS
• 冗長性/レイテンシ: HDFSの性質上Response
Timeへのコミットメントは高くない (障害
復旧/通常のI/O)
• Avatar Node, etc (詳細は論文へ、という)
• ただし書き込みスループットに関しては
トップクラス
23. STILL MYSQL WINS (IN MANY
CASES)
• 基本的なデータストアとしてはMySQLがま
だまだ最重要
• MySQLへの技術投資は引き続き継続して
いく必要があろうかと