Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
34 views

Distributed Large-Scale Graph Processing: Data Mining (CS6720)

1) The document discusses algorithms for maximal matching on graphs in the massively parallel computation (MPC) model. 2) It describes a filtering algorithm that finds a maximal matching in superlinear memory regimes. The algorithm runs in phases, where in each phase edges are randomly sent to a leader machine which computes a maximal matching and broadcasts it back to remove edges. 3) Analysis shows that with high probability, the leader receives at most n/√S edges in each phase, and the number of remaining edges halves each phase.

Uploaded by

Venkata Praneeth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Distributed Large-Scale Graph Processing: Data Mining (CS6720)

1) The document discusses algorithms for maximal matching on graphs in the massively parallel computation (MPC) model. 2) It describes a filtering algorithm that finds a maximal matching in superlinear memory regimes. The algorithm runs in phases, where in each phase edges are randomly sent to a leader machine which computes a maximal matching and broadcasts it back to remove edges. 3) Analysis shows that with high probability, the leader receives at most n/√S edges in each phase, and the number of remaining edges halves each phase.

Uploaded by

Venkata Praneeth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

26-02-2020

John Augustine
Distributed
Jan 16, 2020 Large-Scale
Data Mining (CS6720) Graph Processing

1 2

Shared Memory PRAM Massively Parallel Computation (MPC) Model


• Input data size 𝑁 words; each word = 𝑂(log 𝑁) bits.
MapReduce
• The number of machines 𝑘. (Machines identified by {1, 2,…, 𝑘}.)
Programming
Parallel &
Distributed
Models • Memory size per machine 𝑆 words.
Computing Models Think like a vertex • 𝑆 ≥ 𝑁 is uninteresting. Assume: 𝑆 = 𝑂(𝑁 ) for some 𝜖 ∈ (0,1].
• Also, require 𝑆𝑘 ≥ 𝑁.
• Synchronous communication rounds
Massively Parallel
Computation
• Local computation within each machine
• Create messages for other machines. Sum of message sizes ≤ 𝑆.
Message Passing
• Send… Receive. Ensure no machine requires > 𝑆 memory.
𝑘-machine model • Goal: Solve problem in as few rounds as possible.

3 4
1
26-02-2020

Initial Data Distribution


On Graphs:
𝑁=𝑂 𝑛
• Typically, data is split into words (often as ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs). (Strongly)
Superlinear

• The words could be either randomly distributed or arbitrarily 𝑁 =𝑂 𝑛+𝑚


distributed.
• Load balanced so that no machine has much more than other
machines. = 𝑂(𝑚)
• Output: usually distributed & depends on problem. Memory
• Questions Size 𝑆
• How to achieve random load balanced distribution?
• How to remove duplicates? 𝑁 = 𝑂(𝑛) 𝑁 = 𝑛 for
𝛼 ∈ (0,1).
Near (Strongly)
Linear Sublinear

5 6

Broadcasting Maximal Matching


• Let 𝑆 = 𝑛 for some constant 𝜖 > 0. • A matching in a graph 𝐺 = (𝑉, 𝐸) is a set of edges that don’t share
common vertices.
• One machine src needs to broadcast 𝑛 words.
• Approach 1: the machine sends 𝑘 messages of size 𝑛. If 𝑘 > 𝑛 ???
• A maximum matching is a matching of maximum possible cardinality.
• Approach 2: Build 𝑛 -ary tree with src as root.
• A maximal matching is a matching that ceases to be one when any
• Broadcast takes 𝑂(ℎ𝑒𝑖𝑔ℎ𝑡) rounds edge is added to it.
• ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑂 log 𝑘 =𝑂
• A maximal matching has cardinality at least half of a maximum
since 𝑁 = 𝑝𝑜𝑙𝑦 𝑆 (𝑂(𝑛 ) for graphs) matching. Homework: Prove this.

7 8
2
26-02-2020

Sequential Algorithm for Filtering: Idea to find a maximal matching in


finding a maximal matching. the superlinear memory regime
1. Let 𝑋 = ∅. Preprocessing.
Let ℓ be a designated “leader” machine (say, machine 0). Assume it doesn’t hold any edge at the
2. For each 𝑒 = 𝑢, 𝑣 ∈ 𝐸, beginning. (Why is this OK?) During the course of the algorithm, ℓ maintains a matching (initially
1. If neither 𝑢 nor 𝑣 is an endpoint of any edge in 𝑋, then 𝑋 = 𝑋 ∪ {𝑒}. empty).
Other machines are called regular machines. 𝐺 = 𝑉 , 𝐸 denotes graph during phase 𝑟. We use
3. Output 𝑋. 𝑚 for number of edges in 𝐺 . 𝐺 ← 𝐺.
Steps in each phase 0,1, … (until 𝐺 becomes empty.)
Correctness: 1. Each regular machine marks each local edge independently with probability 𝑝 = and
sends the marked edges to the leader ℓ.
• Invariant: 𝑋 is a matching at all times.
2. The leader ℓ recomputes the maximal matching with edges it received but without losing any
• Suppose 𝑋 is not maximal at the end. Then some edge 𝑒 can be edge from the previous matching. (How?)
added to it and it will remain a matching. But why was 𝑒 rejected? 3. The leader ℓ broadcasts the matching so computed (≤ 𝑛/2 edges) to all machines.
4. Each regular machine removes edges that have at least one common vertex with the received
matching. Isolated vertices are also removed.

9 10

Outline of the Analysis Claim: At most whp at end of round 𝑟


• Correctness is obvious (similar to the sequential algorithm) if • Let 𝐺 = 𝑉 , 𝐸 be the leftover graph at the end of round 𝑟 − 1.
bandwidth limitation is not violated. • For some pair of vertices 𝑢, 𝑣 ∈ 𝑉 , can 𝑒 = 𝑢, 𝑣 have been sent to
the leader? No! (Why? If sent, at least one of 𝑢 or 𝑣 would have been
matched, and therefore discarded.)
• Claims:
• The leader ℓ receives at most 𝑛 edges (whp) in step 1. (Homework)
• Consider any set of vertices 𝐽 with > edges with both end
• If a phase 𝑟 starts with 𝑚 edges, then the number of edges at the end of points in 𝐽.
round 𝑟 is with high probability. • What is the chance that V = 𝐽?
• The total number of rounds is log m∈𝑂 . Why? Pr 𝑎𝑙𝑙 𝑖𝑛𝑑𝑢𝑐𝑒𝑑 𝑒𝑑𝑔𝑒𝑠 𝑛𝑜𝑡 𝑠𝑒𝑛𝑡 ≤ 1 − 𝑝 ≤𝑒 .
There are at most 2 subsets of 𝑉, so by union bound, the result holds.

11 12
3
26-02-2020

Data Distribution
The 𝑘-machine Model
The Random Vertex Partitioning (RVP)
• Input data size 𝑁 words; each word = 𝑂(log 𝑁) bits. • Typically, data is split into words (often as ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs).
• The number of machines 𝑘. (Machines identified by {1, 2,…, 𝑘}.) • The words could be either randomly distributed or arbitrarily
distributed.
• Memory size is unbounded (but usually not abused).
• Typically used in processing large graphs.
• Synchronous communication rounds
• RVP: Most common approach is to randomly partition vertices into 𝑘
• Local computation within each machine parts and place each part into one of the machines. Then, a copy of
• Each machine creates one message of 𝑂(log 𝑛) bits for every other machine. each edge is placed in the (≤ 2) machines that contain either of its
• Send… Receive. end points.
• Goal: Solve problem in as few rounds as possible. • Other partitioning of graph data is also conceivable (e.g., random
edge partitioning, arbitrary edge partitioning, etc.).

13 14

RVP is Load Balanced


Claim: Under RVP of a graph 𝐺 = (𝑉, 𝐸) with 𝑛 vertices and 𝑚 edges,
whp, every machine has
1. at most 𝑂 vertices and
2. at most 𝑂 + Δ edges,
where Δ is the maximum degree in 𝐺.
Proof of part 1 is easy. Just use Chernoff bound.
Proof of part 2 is more complicated and therefore skipped.

15
4

You might also like