8. C Socket for Windows
Client.c
#include<winsock2.h>
#include<stdio.h>
int main() {
SOCKET sockfd;
int len , result;
struct sockaddr_in address;
char ch = 'A';
WSADATA wsadata;
WSAStartup(0x202,(LPWSADATA)&wsadata);
sockfd = socket(AF_INET, SOCK_STREAM, 0);
address.sin_family = AF_INET;
9. C Socket for Windows
Client.c
address.sin_addr.s_addr = inet_addr("127.0.0.1");
address.sin_port = 1234;
len = sizeof(address);
connect(sockfd, (struct sockaddr *)&address, len);
send(sockfd, &ch, 1, 0);
recv(sockfd, &ch, 1, 0);
printf("char from server = %cn", ch);
closesocket(sockfd);
WSACleanup();
system("pause");
}
10. Client and server with threads
Thread 2 makes
requests to server
Input-output
Receipt &
Thread 1 queuing
generates
results T1
Requests
N threads
Client
Server
Distributed Systems: Concepts and Design
11. Alternative server threading architectures
workers per-connection threads per-object threads
I/O remote I/O remote
remote
objects
objects objects
a. Thread-per-request b. Thread-per-connection c. Thread-per-object
Distributed Systems: Concepts and Design
13. C Thread
pthread.c
#include <stdio.h>
#include <pthread.h>
void *thread_func(void *arg);
char message[] = "Hello World";
int main() {
pthread_t thread;
void *thread_result;
pthread_create(&thread,NULL,thread_func,(void *)message);
printf("Waiting for thread to finish...n");
14. C Thread
pthread.c
pthread_join(thread,&thread_result);
printf("Thread joined, it returned %sn",(char *)thread_result);
system("pause");
}
void *thread_func(void *arg) {
printf("thread %s is runningn",(char *)arg);
sleep(3);
pthread_exit("Thange you use CPU Timen");
}
15. Java TCP Socket (per-connection threads)
Client.java
String data = in.readUTF();
import java.net.*;
System.out.println("Received: "+ data) ;
import java.io.*;
s.close();
public class Client {
}catch (IOException e){
public static void main (String args[]) {
System.out.println(e.getMessage());
Socket s = null;
}finally {
try{
if(s!=null)
int serverPort = 1234;
try {s.close();}
s = new Socket("localhost", serverPort);
catch (IOException e){}
DataInputStream in = new
DataInputStream( s.getInputStream()); }
DataOutputStream out = new }
DataOutputStream( s.getOutputStream()); }
out.writeUTF(“Hello");
16. Java TCP Socket (per-connection threads)
Server.java
import java.net.*;
import java.io.*;
public class Server {
public static void main(String args[]) {
try{
int serverPort = 1234;
ServerSocket listenSocket = new ServerSocket(serverPort);
while(true) {
Socket clientSocket = listenSocket.accept();
Connection c = new Connection(clientSocket);
}
} catch(IOException e) {
System.out.println(e.getMessage());
}
}
}
17. Java TCP Socket (per-connection threads)
Connection.java
this.start();
} catch(IOException e){
import java.net.*; System.out.println(e.getMessage());}
import java.io.*; }
class Connection extends Thread { public void run(){
DataInputStream in; try {
DataOutputStream out; String data = in.readUTF();
Socket clientSocket; out.writeUTF("client data is " + data);
public Connection (Socket ClientSocket) { } catch(IOException e) {
try { System.out.println(e.getMessage());
clientSocket = ClientSocket; } finally {
in = new try {
DataInputStream( clientSocket.getInputStream());
clientSocket.close();
out = new
} catch (IOException e) {}
DataOutputStream( clientSocket.getOutputStream());
}
}
}
18. 時間同步的類型
External
Synchronize all clocks against a single one, usually
the one with external, accurate time information
Internal
Synchronize all clocks among themselves
At least time monotonicity must be preserved
19. 時間同步的類型
External (accuracy) :
同步於驗證來源的時間
Each system clock Ci S
differs at most Dext at
every point in the
synchronization interval
from an external UTC
source S:
|S - Ci| < Dext for all i C1 C3
C2
20. 時間同步的類型
Internal
(agreement) :
彼此間合力同步時間
Any two system clocks C1 C3
Ci and Cj differs at
most Dint at every point C2
in the synchronization
interval from each
other:
| Cj - Ci| < Dint
for all i and j
21. 時間同步的類型
Dext and Dint are synchronization bounds
Dint <= 2Dext
Max-Synch-interval = Dint / 2Dext
It means:
If two events have single-value timestamps which
differ by less than some value,we CAN‟T SAY in
which order the events occurred.
With interval timestamps, when intervals overlap, we
CAN‟T SAY in which order the events occurred.
22. 同步系統時間
TB
B B‟s clock time
TA TA+Ttrans
A A‟s clock time
Ttrans
real time
Tmin < Ttrans < Tmax
Ttrans= (Tmin+ Tmax)/2 is at most wrong by (Tmin- Tmax)/2
If A sends its clock time TA to B
→ B can set its clock to TA + (Tmin+ Tmax)/2
→ then A and B are synchronized with bound (Tmin- Tmax)/2
Tmin (Tmin+ Tmax)/2 Tmax
Ttrans
(Tmin- Tmax)/2(Tmin- Tmax)/2
23. 非同步系統時間
TB TB +Tround/2
B B‟s clock time
TA TA+Ttrans T‟A
A A‟s clock time
Tround
In asynchronous system, we have no Tmax
How can A synchronize with B?
By using the round-trip time Tround=TA-T‟A in Cristian‟s algorithm:
TB= TB+ Tround/2
25. JAVA RMI (External Clock Synchronize)
Clock.java
import java.rmi.*;
public interface Clock extends Remote{
String getTime() throws RemoteException;
}
ClockImpl.java
import java.rmi.*;
import java.rmi.server.*;
import java.util.*;
public class ClockImpl extends UnicastRemoteObject implements Clock {
public ClockImpl() throws RemoteException {
super();
}
public String getTime() {
Date d = new Date();
return d.toString();
}
}
26. JAVA RMI (External Clock Synchronize)
ClockServer.java
import java.rmi.*;
public class ClockServer {
public ClockServer() {
try {
Clock c = new ClockImpl();
Naming.rebind("//localhost/ClockService",c);
} catch (Exception e) {
System.out.print(e.getMessage());
}
}
public static void main(String args[]) {
new ClockServer();
}
}
27. JAVA RMI (External Clock Synchronize)
ClockClient.java
import java.rmi.*;
import java.net.*;
public class ClockClient {
public static void main(String args[]) {
try {
Clock c = (Clock)Naming.lookup("//localhost/ClockService");
System.out.println(c.getTime());
} catch (Exception e) {
System.out.print(e.getMessage());
}
}
}
28. Logical time
One aspect of clock synchronization is to provide a mechanism
whereby systems can assign sequence numbers (“timestamps”) to
messages upon which all cooperating processes can agree.
Leslie Lamport (1978) showed that clock synchronization need
not be absolute and L. Lamport„s two important points lead to
“causality”
First point:
If two processes do not interact, it is not necessary that their
clocks be synchronized
they can operate concurrently without fear of interferring with each
other
Second (critical) point:
It is not important that all processes agree on time, but
rather, that they agree on the order in which events occur
Such “clocks” are referred to as Logical Clocks
Logical time is based on happens-before relationship
29. 事件序列 Event Ordering
Happens before and concurrent events illustrated
No causal path neither
from e1 to e2 nor from e2 to e1
e1 and e2 are concurrent
from e1 to e6 nor from e6 to e1
e1 and e6 are concurrent
from e2 to e6 nor from e6 to e2
e2 and e6 are concurrent
Types of events
Send
Receive
Internal (change of state)
30. 協調 Co-ordination
對於分散式系統的困難點
Centralised solutions not appropriate
communications bottleneck
Fixed master-slave arrangements not appropriate
process crashes
Varying network topologies
ring, tree, arbitrary; connectivity problems
Failures must be tolerated if possible
link failures
process crashes
Impossibility results
in presence of failures, esp asynchronous model
31. Mutual Exclusion
要求
Safety
At most one process may execute in CS at any time
Liveness
Every request to enter and exit a CS is eventually granted
Ordering (desirable)
Requests to enter are granted according to causality order (FIFO)
Synchronization
Centralized Distributed
scheme
Based on mutual Central Circulating
exclusion process token
No mutual Physical Clock Physical clocks
exclusion Event Count Logical clocks
45. Algorithm
RingLeader(id):
Input:The unique identifier, id, for the processor running
Output:The smallest identifier of a processor in the ring
M←[Candidate is id]
Send message M to the successor processor in the ring
done←false
repeat
Get message M from the predecessor processor in the ring.
if M=[Candidate is i] then
if i=id then
M←[Leader is id]
done←true
46. Algorithm
else
m←min{i,id}
M←[Candidate is m]
else
{M is a “Leader is” message}
done←true
Send message M to the next processor in the ring
until done
return M
47. Analysis
Computational Rounds
O(2N)
Local Running Time
O(N)
Local Spaced
O(1)
Message Complexity
O(N2)
49. Algorithm
TreeLeader(id):
Input:The unique identifier, id, for the processor running
Output:The smallest identifier of a processor in the ring
{Accumulation Phase}
Let d be the number of neighbors of processor id
m ←0 {counter for messages received}
ℓ ←id {tentative leader}
repeat
{begin a new round}
for each neighbor j do
check if a message from processor j has arrived
if a message M = [Candidate is i] from j has arrived then
ℓ←min{i. ℓ}
m←m+1
50. Algorithm
until m > d-1
if m=d then
M←[Leader is ℓ]
for each neighbor i≠k do
send message M to processor j
return M {M is a “leader is ” message}
else
M←[Candidate is ℓ]
send M to the neighbor k that has not sent a message yet
51. Algorithm
{Broadcast Phase}
repeat
{begin a new round}
check if a message from processor k has arrived
if a message M from k has arrived then
m←m+1
if M=[Candidate is i] then
ℓ←min{i,ℓ}
M←[Leader is ℓ]
for each neighbor j do
send message M to process j
52. Algorithm
else
{M is a “leader is” message}
for each neighbor j≠k do
send message M to processor j
until m=d
return M {M is a “leader is” message}
53. Analysis
• di為處理器i的相鄰Process之數量
Computational Rounds
O(D)
Local Running Time
O(diD)
Local Spaced
O(di)
Message Complexity
O(N)
56. Algorithm
SynchronousBFS(v,s):
Input: The identifier v of the node (processor) executing this algorithm and
the identifier s of the start node of the BFS traversal
Output: For each node v, its parent in a BFS tree rooted at s
repeat
{begin a new round}
if v=s or v has received a message from one of its neighbors then
set parent(v) to be a node requesting v to become its child
(or null, if v=s)
for each node w adjacent to v that has not contacted v yet do
send a message to w asking w to become a child of v
until v=s or v has received a message
57. Analysis
n個節點,m個邊
Computational Rounds
Local Running Time
Local Spaced
Message complexity
O(n+m)
59. Algorithm
AsynchronousBFS(v,s):
Input: The identifier v of the node (processor) executing this
algorithm and the identifier s of the start node of the BFS
traversal
Output: For each node v, its parent in a BFS tree rooted at s
C←ø {verified BFS children for v}
set A to be the set of neighbors of v
repeat
{begin a new round}
if parent(v) is defined or v=s then
if parent(v) is defined then
wait for pulse-down message from parent(v)
60. Algorithm
if C is not empty then
{v is an internal node in the BFS tree}
send a pulse-down message to all nodes in C
wait for a pulse-up message from all nodes in C
else
{v is an external node in the BFS tree}
for each node u in A do
send a make child message to u
61. Algorithm
for each node u in A do
get a message M from u and remove u from A
if M is an accept-child message then
add u to C
send a pulse-up message to parent(v)
else
{v ≠s has no parent yet}
for each node w in A do
if w has sent v a make-child message then
remove w from A
{w is no longer a candidate child for v}
62. Algorithm
if parent(v) is undefined then
parent(v)←w
send an accept-child message to w
else
send a reject-child message to w
until (v has received message done)
or (v=s and has pulsed-down n-1 times)
send a done message to all the nodes in C
63. Analysis
• n個節點,m個邊
Computational Rounds
Local Running Time
Local Spaced
Message complexity
O(n2+m)
65. Baruskal Algorithm
KruskalMST(G):
Input: A simple connected weighted graph G
with n vertices and m edges
Output: A minimum spanning tree T for G
for each vertext v in G do
define an elementary cluster C(v)←{v}
initialize a priority queue Q to contain all edges in G,
using the weights as keys
T←ø
66. Baruskal Algorithm
while T has fewer than n-1 edges do
(u,v)←Q.removeMin()
Let C(v) be the cluster containing v ,
Let C(u) be the cluster containing u.
if C(v)≠C(u) then
Add edge(v,u) to T.
Merge C(v) and C(u) into one cluster,
that is union C(v) and C(u).
return tree T
67. Analysis
• n個節點,m個邊
Computational Rounds
O(logn)
Local Running Time
Local Spaced
O(m)
Message complexity
O(mlogn)
69. Synchronization Algorithms
Multicast
Uses a central time server to synchronize clocks
Cristian‟s algorithm (centralised)
Berkeley algorithm (centralised)
The Network Time Protocol (decentralised)
69
70. Cristian’s Algorithm(1989)
使用time server來同步時間,且為保留供參考的時間
Clients ask the time server for time
period depends on maximum clock drift and accuracy required
Clients receive the value and may:
use it as it is
add the known minimum network delay
add half the time between this send and receive
For links with symmetrical latency:
RTT = resp.-received-time – req.-sent-time
adjusted-local-time =
server-timestamp + minimum network delay or
server-timestamp + (RTT / 2) or
server-timestamp + (RTT – server-latency) /2
local-clock-error = adjusted-local-time – local-time
71. Berkeley algorithm (Gusella & Zatti, 1989)
if no machines have receivers, …
Berkeley algorithm uses a designated server to
synchronize
The designated server polls or broadcasts
to all machines for their time,
adjusts times received for RTT & latency,
averages times, and tells each machine how to adjust.
Polling is done using Cristian‟s algorithm
Avg. time is more accurate, but still drifts
72. Network Time Protocol
NTP is a best known and most widely implemented
decentralised algorithm
Used for time synchronization on Internet
1 Primary server,
direct synchronization
Secondary server,
2 2 2
synchronized by
the primary server
3 3 3 3 3 3
Tertiary server,
synchronized by
www.ntp.org the secondary server
74. 假設
Each pair of processes is connected by reliable
channels (such as TCP).
Messages are eventually delivered to recipients‟ input
buffer.
Processes will not fail.
There is agreement on how a resource is identified
Pass identifier with requests
76. Centralized Algorithm
Operations Request(R
1. Request resource ) C
Send request to coordinator to enter CS Grant(R)
2. Wait for response P
3. Receive grant Release(R)
Grants permission to enter CS
keeps a queue of requests to enter the CS.
4. access resource Coordinator
Queue of
5. Release resource
Requests 4
Send release message to inform coordinator
2
Safety, liveness and order are guaranteed Grant
Delay Request
P1 P4
Client and Synchronization Release
one round trip time (release + grant)
P2 P3
77. Token Ring Algorithm
Operations
For each CS a token is used.
Only the process holding the token can enter the CS.
To exit the CS, the process sends the token onto its neighbor.
If a process does not require to enter the CS when it receives the
token, it forwards the token to the next neighbor.
在一個時間只會有一個程序取得Token,保證Mutual exclusion
Order well-defined,讓Starvation不會發生
假如token遺失 (e.g. process died),將必須重新產生
Safety & liveness are guaranteed, but ordering is not.
Delay
Client : 0 to N message transmissions.
Synchronization :between one process‟s exit from the CS and the next
process‟s entry is between 1 and N message transmissions.
78. Lamport Algorithm
A total ordering of requests is established by logical
timestamps.
Each process maintains request Queue (mutual exclusion requests)
Requesting CS, Pi
multicasts “request” (i, Ti) to all processes (Ti is local Lamport time).
Places request on its own queue
waits until all processes “reply”
Entering CS, Pi
receives message (ack or release) from every other process with a
timestamp larger than Ti
Releasing CS , Pi
Remove request from its queue
Send a timestamped release message
This may cause its own entry have the earliest timestamp in the
queue, enabling it to access the critical section
79. Ricart & Agrawala Algorithm
Using reliable multicast and logical clocks
Process wants to enter critical section
Compose message containing
Identifier (machine ID, process ID)
Name of resource
Current time
Send request to all processes ,wait until everyone gives permission
When process receives request
If receiver not interested →Send OK to sender
If receiver is in critical section →Do not reply; add request to queue
If receiver just sent a request as well:
Compare timestamps: received & sent msgs→Earliest wins
If receiver is loser then send OK else receiver is winner, do not reply, queue
When done with critical section→Send OK to all queued requests
80. Ricart & Agrawala Algorithm
On initialization
state := RELEASED;
To enter the critical section
state := WANTED;
Multicast request to all processes; request processing deferred
here
T := request‟s timestamp;
Wait until (number of replies received = (N – 1));
state := HELD;
On receipt of a request <Ti, pi> at pj (i≠ j)
if (state = HELD) or ((state = WANTED) and ((T, pj) < (Ti, pi))
then queue request from pi without replying;
else reply immediately to pi;
To exit the critical section
state := RELEASED;
reply to any queued requests;
81. Ricart & Agrawala Algorithm
Safety, liveness, and ordering are guaranteed.
It takes 2(N-1) messages per entry operation (N-1 multicast
requests + N-1 replies); N messages if the underlying network
supports multicast. [3(N-1) in Lamport‟s algorithm]
Delay
Client P3
one round-trip time P1 P1 remains in
Synchronization “wanted” until
P2 sends “reply”
one message transmission time.
Reply
P2不能傳Reply給P1 P2 P2 message:
因為Timestamp →P1大於P2
Timestamp is 78
P2 Changes to “held” P1 message:
Timestamp is 87
82. Leader Election Algorithms
Solution the problem
N processes, may or may not have unique IDs (UIDs)
for simplicity assume no crashes
must choose unique master coordinator amongst processes
Requirements
Every process knows P, identity of leader, where P is unique
process id (usually maximum) or is yet undefined.
All processes participate and eventually discover the identity
of the leader (cannot be undefined).
When a coordinator fails, the algorithm must elect that active
process with the largest priority number
兩種類型的演算法
Bully: “the biggest guy in town wins”
Ring: a logical, cyclic grouping
83. Bully Algorithm
假設
Synchronous system
All messages arrive within Ttrans units of time.
A reply is dispatched within Tprocess units of time of the receipt of a message.
if no response is received in 2Ttrans + Tprocess, the node is assumed to be dead.
若Process知道自己有最高的id,就會elect自己當Coordinator
且會傳送coordinator訊息給所有比其id低的其餘process
當Process P注意到coordinator太久沒回應要求,就初始一個election
當Process P拿到election就會傳送election訊息給其餘process
若都沒人回應,P就會當Coordinator
若有一個人有更higher numbered process回答,就結束P‟s job is done
84. Bully Algorithm
Performce
Best case scenario: The process with the second highest id
notices the failure of the coordinator and elects itself.
N-2 coordinator messages are sent.
Turnaround time is one message transmission time.
Worst case scenario: When the process with the least id
detects the failure.
N-1 processes altogether begin elections, each sending messages to
processes with higher ids.
The message overhead is O(N2).
Turnaround time is approximately 5 message transmission times.
85. Ring Algorithm
No token is used in this algorithm
當演算法結束時,任一Process分有Active清單(consisting of all the
priority numbers of all active processes in the system)
若Process Pi偵測Coordinator failure,就會建立初始空白的Active
清單,之後傳送訊息elect(i)給Pi的right neighbor,和增加number i
到Pi的Active清單
若Pi接收到訊訊elect(j)從左邊的Process,它必須有所回應
If this is the first elect message it has seen or sent, Pi creates a new
active list with the numbers i and j and send the message elect(j)
If i j, then the active list for Pi now contains the numbers of all the
active processes in the system , Pi can now determine the largest
number in the active list to identify the new coordinator process
If i = j, then Pi receives the message elect(i) , The active list for Pi
contains all the active processes in the system Pi can now determine
the new coordinator process.
86. Chang&Roberts Algorithm
Assume
Unidirectional ring
Asynchronous system
Each Process has UID
Election
initially each process non-participant
determine leader (election message):
initiator becomes participant and passes own UID on to neighbour
when non-participant receives election message, forwards maximum
of own and the received UID and becomes participant
participant does not forward the election message
announce winner (elected message):
when participant receives election message with own UID, becomes
leader and non-participant, and forwards UID in elected message
otherwise, records the leader‟s UID, becomes non-participant and
forwards it
87. Itai&Rodeh Algorithm
Assume
Unidirectional ring
Synchronous system
Each Process not has UID
Election
each process selects ID at random from set {1,..K}
non-unique! but fast
process pass all IDs around the ring
after one round, if there exists a unique ID then
elect maximum unique ID
otherwise, repeat
How do know the algorithm terminates?
from probabilities:if you keep flipping a fair coin then after
several heads you must get tails