Conductor

This page will teach you what the op-conductor service is and how it works on a high level. It will also get you started on setting it up in your own environment.

Enhancing Sequencer Reliability and Availability

The op-conductor (opens in a new tab) is an auxiliary service designed to enhance the reliability and availability of a sequencer within high-availability setups. By minimizing the risks associated with a single point of failure, the op-conductor ensures that the sequencer remains operational and responsive.

Assumptions

It is important to note that the op-conductor does not incorporate Byzantine fault tolerance (BFT). This means the system operates under the assumption that all participating nodes are honest and act correctly.

Summary of Guarantees

The design of the op-conductor provides the following guarantees:

No Unsafe Reorgs
No Unsafe Head Stall During Network Partition
100% Uptime with No More Than 1 Node Failure

Design

op-conductor.

On a high level, op-conductor serves the following functions:

Raft Consensus Layer Participation

Leader Determination: Participates in the Raft consensus algorithm to determine the leader among sequencers.
State Management: Stores the latest unsafe block ensuring consistency across the system.

RPC Request Handling

Admin RPC: Provides administrative RPCs for manual recovery scenarios, including, but not limited to: stopping the leadership vote and removing itself from the cluster.
Health RPC: Offers health RPCs for the op-node to determine whether it should allow the publishing of transactions and unsafe blocks.

Sequencer Health Monitoring

Continuously monitors the health of the sequencer (op-node) to ensure optimal performance and reliability.

Control Loop Management

Implements a control loop to manage the status of the sequencer (op-node), including starting and stopping operations based on different scenarios and health checks.

Conductor State Transition

The following is a state machine diagram of how the op-conductor manages the sequencers Raft consensus.

op-conductor-state-transition.

Helpful tips: To better understand the graph, focus on one node at a time, understand what can be transitioned to this current state and how it can transition to other states. This way you could understand how we handle the state transitions.

Setup

At OP Labs, op-conductor is deployed as a kubernetes statefulset because it requires a persistent volume to store the raft log. This guide describes setting up conductor on an existing network without incurring downtime.

Assumptions

This setup guide has the following assumptions:

3 deployed sequencers (sequencer-0, sequencer-1, sequencer-2) that are all in sync and in the same vpc network
sequencer-0 is currently the active sequencer
You can execute a blue/green style sequencer deployment workflow that involves no downtime (described below)
conductor and sequencers are running in k8s or some other container orchestrator (vm-based deployment may be slightly different and not covered here)

Spin up op-conductor

Deploy conductor

Deploy a conductor instance per sequencer with sequencer-1 as the raft cluster bootstrap node:

suggested conductor configs:

OP_CONDUCTOR_CONSENSUS_ADDR: '<raft url or ip>'
OP_CONDUCTOR_CONSENSUS_PORT: '50050'
OP_CONDUCTOR_EXECUTION_RPC: '<op-geth url or ip>:8545'
OP_CONDUCTOR_HEALTHCHECK_INTERVAL: '1'
OP_CONDUCTOR_HEALTHCHECK_MIN_PEER_COUNT: '2'  # set based on your internal p2p network peer count 
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL: '5' # recommend a 2-3x multiple of your network block time to account for temporary performance issues
OP_CONDUCTOR_LOG_FORMAT: logfmt
OP_CONDUCTOR_LOG_LEVEL: info
OP_CONDUCTOR_METRICS_ADDR: 0.0.0.0
OP_CONDUCTOR_METRICS_ENABLED: 'true'
OP_CONDUCTOR_METRICS_PORT: '7300'
OP_CONDUCTOR_NETWORK: '<network>'
OP_CONDUCTOR_NODE_RPC: '<op-node url or ip>:8545'
OP_CONDUCTOR_RAFT_SERVER_ID: 'unique raft server id'
OP_CONDUCTOR_RAFT_STORAGE_DIR: /conductor/raft
OP_CONDUCTOR_RPC_ADDR: 0.0.0.0
OP_CONDUCTOR_RPC_ENABLE_ADMIN: 'true'
OP_CONDUCTOR_RPC_ENABLE_PROXY: 'true'
OP_CONDUCTOR_RPC_PORT: '8547'

sequencer-1 op-conductor extra config:

OP_CONDUCTOR_PAUSED: "true"
OP_CONDUCTOR_RAFT_BOOTSTRAP: "true"

Pause two conductors

Pause sequencer-0 & sequencer-1 conductors with conductor_pause RPC request.

Update op-node configuration and switch the active sequencer

Deploy an op-node config update to all sequencers that enables conductor. Use a blue/green style deployment workflow that switches the active sequencer to sequencer-1:

all sequencer op-node configs:

OP_NODE_CONDUCTOR_ENABLED: "true"
OP_NODE_RPC_ADMIN_STATE: "" # this flag cant be used with conductor

Confirm sequencer switch was successful

Confirm sequencer-1 is active and successfully producing unsafe blocks. Because sequencer-1 was the raft cluster bootstrap node, it is now committing unsafe payloads to the raft log.

Add voting nodes

Add voting nodes to cluster using conductor_AddServerAsVoter RPC request to the leader conductor (sequencer-1)

Confirm state

Confirm cluster membership and sequencer state:

sequencer-0 and sequencer-2:
1. raft cluster follower
2. sequencer is stopped
3. conductor is paused
4. conductor enabled in op-node config
sequencer-1
1. raft cluster leader
2. sequencer is active
3. conductor is paused
4. conductor enabled in op-node config

Resume conductors

Resume all conductors with conductor_resume RPC request to each conductor instance.

Confirm state

Confirm all conductors successfully resumed with conductor_paused

Tranfer leadership

Trigger leadership transfer to sequencer-0 using conductor_transferLeaderToServer

Confirm state

sequencer-1 and sequencer-2:
1. raft cluster follower
2. sequencer is stopped
3. conductor is active
4. conductor enabled in op-node config
sequencer-0
1. raft cluster leader
2. sequencer is active
3. conductor is active
4. conductor enabled in op-node config

Update configuration

Deploy a config change to sequencer-1 conductor to remove the OP_CONDUCTOR_PAUSED: true flag and OP_CONDUCTOR_RAFT_BOOTSTRAP flag.

Blue/Green Deployment

In order to ensure there is no downtime when setting up conductor, you need to have a deployment script that can update sequencers without network downtime.

An example of this workflow might look like:

Query current state of the network and determine which sequencer is currently active (referred to as "original" sequencer below). From the other available sequencers, choose a candidate sequencer.
Deploy the change to the candidate sequencer and then wait for it to sync up to the original sequencer's unsafe head. You may want to check peer counts and other important health metrics.
Stop the original sequencer using admin_stopSequencer which returns the last inserted unsafe block hash. Wait for candidate sequencer to sync with this returned hash in case there is a delta.
Start the candidate sequencer at the original's last inserted unsafe block hash.
1. Here you can also execute additional check for unsafe head progression and decide to roll back the change (stop the candidate sequencer, start the original, rollback deployment of candidate, etc.)
Deploy the change to the original sequencer, wait for it to sync to the chain head. Execute health checks.

Post-Conductor Launch Deployments

After conductor is live, a similar canary style workflow is used to ensure minimal downtime in case there is an issue with deployment:

Choose a candidate sequencer from the raft-cluster followers
Deploy to the candidate sequencer. Run health checks on the candidate.
Transfer leadership to the candidate sequencer using conductor_transferLeaderToServer. Run health checks on the candidate.
Test if candidate is still the leader using conductor_leader after some grace period (ex: 30 seconds)
1. If not, then there is likely an issue with the deployment. Roll back.
Upgrade the remaining sequencers, run healthchecks.

Configuration Options

It is configured via its flags / environment variables (opens in a new tab)

--consensus.addr (`CONSENSUS_ADDR`)

Usage: Address to listen for consensus connections
Default Value: 127.0.0.1
Required: yes

--consensus.port (`CONSENSUS_PORT`)

Usage: Port to listen for consensus connections
Default Value: 50050
Required: yes

--raft.bootstrap (`RAFT_BOOTSTRAP`)

For bootstrapping a new cluster. This should only be used on the sequencer that is currently active and can only be started once with this flag, otherwise the flag has to be removed or the raft log must be deleted before re-bootstrapping the cluster.

Usage: If this node should bootstrap a new raft cluster
Default Value: false
Required: no

--raft.server.id (`RAFT_SERVER_ID`)

Usage: Unique ID for this server used by raft consensus
Default Value: None specified
Required: yes

--raft.storage.dir (`RAFT_STORAGE_DIR`)

Usage: Directory to store raft data
Default Value: None specified
Required: yes

--node.rpc (`NODE_RPC`)

Usage: HTTP provider URL for op-node
Default Value: None specified
Required: yes

--execution.rpc (`EXECUTION_RPC`)

Usage: HTTP provider URL for execution layer
Default Value: None specified
Required: yes

--healthcheck.interval (`HEALTHCHECK_INTERVAL`)

Usage: Interval between health checks
Default Value: None specified
Required: yes

--healthcheck.unsafe-interval (`HEALTHCHECK_UNSAFE_INTERVAL`)

Usage: Interval allowed between unsafe head and now measured in seconds
Default Value: None specified
Required: yes

--healthcheck.safe-enabled (`HEALTHCHECK_SAFE_ENABLED`)

Usage: Whether to enable safe head progression checks
Default Value: false
Required: no

--healthcheck.safe-interval (`HEALTHCHECK_SAFE_INTERVAL`)

Usage: Interval between safe head progression measured in seconds
Default Value: 1200
Required: no

--healthcheck.min-peer-count (`HEALTHCHECK_MIN_PEER_COUNT`)

Usage: Minimum number of peers required to be considered healthy
Default Value: None specified
Required: yes

--paused (`PAUSED`)

There is no configuration state, so if you unpause via RPC and then restart, it will start paused again.

Usage: Whether the conductor is paused
Default Value: false
Required: no

--rpc.enable-proxy (`RPC_ENABLE_PROXY`)

Usage: Enable the RPC proxy to underlying sequencer services
Default Value: true
Required: no

RPCs

Conductor exposes admin RPCs (opens in a new tab) on the conductor namespace.

conductor_overrideLeader

OverrideLeader is used to override the leader status, this is only used to return true for Leader() & LeaderWithID() calls. It does not impact the actual raft consensus leadership status. It is supposed to be used when the cluster is unhealthy and the node is the only one up, to allow batcher to be able to connect to the node so that it could download blocks from the manually started sequencer.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_overrideLeader","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_pause

Pause pauses op-conductor.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_pause","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_resume

Resume resumes op-conductor.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_resume","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_paused

Paused returns true if the op-conductor is paused.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_paused","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_stopped

Stopped returns true if the op-conductor is stopped.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_stopped","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_sequencerHealthy

SequencerHealthy returns true if the sequencer is healthy.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_sequencerHealthy","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_leader

API related to consensus.

Leader returns true if the server is the leader.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_leader","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_leaderWithID

API related to consensus.

LeaderWithID returns the current leader's server info.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_leaderWithID","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_addServerAsVoter

API related to consensus.

AddServerAsVoter adds a server as a voter to the cluster.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_addServerAsVoter","params":[<id>, <addr>, <version>],"id":1}'  \
    http://127.0.0.1:50050

conductor_addServerAsNonvoter

API related to consensus.

AddServerAsNonvoter adds a server as a non-voter to the cluster. non-voter The non-voter will not participate in the leader election.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_addServerAsNonvoter","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_removeServer

API related to consensus.

RemoveServer removes a server from the cluster.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_removeServer","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_transferLeader

API related to consensus.

TransferLeader transfers leadership to another server.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_transferLeader","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_transferLeaderToServer

API related to consensus.

TransferLeaderToServer transfers leadership to a specific server.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_transferLeaderToServer","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_clusterMembership

ClusterMembership returns the current cluster membership configuration.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_clusterMembership","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_active

API called by op-node.

Active returns true if the op-conductor is active (not paused or stopped).

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_active","params":[],"id":1}'  \
    http://127.0.0.1:50050

conductor_commitUnsafePayload

API called by op-node.

CommitUnsafePayload commits an unsafe payload (latest head) to the consensus layer.

curl -X POST -H "Content-Type: application/json" --data \
    '{"jsonrpc":"2.0","method":"conductor_commitUnsafePayload","params":[],"id":1}'  \
    http://127.0.0.1:50050

Next Steps

Checkout op-conductor-mon (opens in a new tab): which monitors multiple op-conductor instances and provides a unified interface for reporting metrics.

Configure Challenger For Your Chain Block Explorer

Conductor

Enhancing Sequencer Reliability and Availability

Assumptions

Summary of Guarantees

Design

Raft Consensus Layer Participation

RPC Request Handling

Sequencer Health Monitoring

Control Loop Management

Conductor State Transition

Setup

Assumptions

Spin up op-conductor

Deploy conductor

Pause two conductors

Update op-node configuration and switch the active sequencer

Confirm sequencer switch was successful

Add voting nodes

Confirm state

Resume conductors

Confirm state

Tranfer leadership

Confirm state

Update configuration

Blue/Green Deployment

Post-Conductor Launch Deployments

Configuration Options

--consensus.addr (CONSENSUS_ADDR)

--consensus.port (CONSENSUS_PORT)

--raft.bootstrap (RAFT_BOOTSTRAP)

--raft.server.id (RAFT_SERVER_ID)

--raft.storage.dir (RAFT_STORAGE_DIR)

--node.rpc (NODE_RPC)

--execution.rpc (EXECUTION_RPC)

--healthcheck.interval (HEALTHCHECK_INTERVAL)

--healthcheck.unsafe-interval (HEALTHCHECK_UNSAFE_INTERVAL)

--healthcheck.safe-enabled (HEALTHCHECK_SAFE_ENABLED)

--healthcheck.safe-interval (HEALTHCHECK_SAFE_INTERVAL)

--healthcheck.min-peer-count (HEALTHCHECK_MIN_PEER_COUNT)

--paused (PAUSED)

--rpc.enable-proxy (RPC_ENABLE_PROXY)

RPCs

conductor_overrideLeader

conductor_pause

conductor_resume

conductor_paused

conductor_stopped

conductor_sequencerHealthy

conductor_leader

conductor_leaderWithID

conductor_addServerAsVoter

conductor_addServerAsNonvoter

conductor_removeServer

conductor_transferLeader

conductor_transferLeaderToServer

conductor_clusterMembership

conductor_active

conductor_commitUnsafePayload

Next Steps

--consensus.addr (`CONSENSUS_ADDR`)

--consensus.port (`CONSENSUS_PORT`)

--raft.bootstrap (`RAFT_BOOTSTRAP`)

--raft.server.id (`RAFT_SERVER_ID`)

--raft.storage.dir (`RAFT_STORAGE_DIR`)

--node.rpc (`NODE_RPC`)

--execution.rpc (`EXECUTION_RPC`)

--healthcheck.interval (`HEALTHCHECK_INTERVAL`)

--healthcheck.unsafe-interval (`HEALTHCHECK_UNSAFE_INTERVAL`)

--healthcheck.safe-enabled (`HEALTHCHECK_SAFE_ENABLED`)

--healthcheck.safe-interval (`HEALTHCHECK_SAFE_INTERVAL`)

--healthcheck.min-peer-count (`HEALTHCHECK_MIN_PEER_COUNT`)

--paused (`PAUSED`)

--rpc.enable-proxy (`RPC_ENABLE_PROXY`)