Post on 07-Jan-2017
Ali Kafel, VP of Business Development
Ensuring High Availability and Resiliency for NFV
Monday 15th February, 2016,
3.00 - 6.00pm
Croke Park, Dublin 3, Ireland
1
MOVING IT TO THE FIELD(CO-LOCATED WITH ETSI NFV#13)
The details of this presentation are covered in this White Paper:
http://www.slideshare.net/akafel/nfv-resiliency-whitepaper-ali-kafel-stratus-technologies
Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
Agenda
2
3
Stratus Technologies
Intel PlatformsftServer
Hardware Fault Tolerance
Proprietary Platforms
1980 - Present
Software Fault Tolerance
everRun Enterprise12,000+ Installed
2008 - Present
Trusted Name in Fault Tolerant Computing for 35 years
Stratus Fault Tolerant Cloud
Resilient Cloud TechnologiesBased of proven SW infrastructure
2015-present
Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
4
5
Why the need for Resiliency in NFV
• It is no longer about voice services ….. Certain data and video services need HA and Resiliency more that voice
• Even “mature” cloud technologies still lack HA and Resiliency
uptime hours mins secs
99.9% 8.76 525.6 31536
99.99% 52.56 3154
99.999% 5.256 315.4
99.9999% 0.526 31.54
Down time
Reliability• How long a system performs its intended function.
• MTBF = total time in service / number of failures
Availability
• % of time an equipment is in an operable state ie. Service accessible and
service continuity
• Availability (A) = Uptime / (Uptime + Downtime);
• A = MTBF / (MTBF + MTTR)
Resiliency
• The ability to recover quickly from failures, to return to its original form /
state to maintain operable state + QoS
• Resiliency (R) = Availability (A) + QoS
What you need is R, not just A… because, for example:… A 99.999% application that fails once a week for just 1 secs and disrupts active services is not
Resilient and not acceptable
A 99.9999% application that causes increases latency during a fault is not acceptable
Defining Reliability, Availability and Resiliency
Stratus Technologies Page 2
Resiliency Management cannot be done in the VNFs…..Because you cannot manage what you cannot see
VNFs
Virtualized Resources
Performance Faults
Resource Depletion
Fault Impacts
External Dependencies
Acce
ss N
etw
ork
s
Are exposed to
Depend 0n
VNFM
SDNC-OL
SDNC-UL
Shared
StorageShared
Network
NFVI Fabric
NODE HW C/N/S
NODE SW
C/N/S
Virtualization SW
vC, vN, Vs
Facility Infra
DCIM
Core
Netw
ork
s
Over 80% of system failure
modes are not directly
visible by the VNFs
Infrastructure decoupling hides
the information required to take
actions on faults from VNFs
VIM
HW Faults
SW Faults
Config Faults
Migrations Upgrades
7Stratus Technologies
Resiliency management can be “designed In” in multiple waysbut it’s best done in the Software Infrastructure
Applications / VNFs
Operating Environment
Hardware
• Transparent – no code change• Fast & Simple Deployment• No special App Software
• Very expensive• Inefficient utilization• Special Hardware• Rigid
Costs
& R
eso
urc
es
Pros
Cons
In the Hardware In the Applications In the Software Infrastructure
Applications / VNFs
.
.
.
.
Operating Environment
Hardware
• App specific state can be Customized
• Can’t detect & manage all infrastructure faults • Code written for resiliency increased by ~40%• Most developers don’t have Resiliency experience• More complex & Longer time to develop
Middleware
Applications / VNFs
Operating Environment
with Resilience Layer
Hardware
• Needs to be adaptable to a wide range of Application Architectures
• Broader & Faster fault detection and correlation• Faster and simpler Application development• Transparent – no code changes• Multiple levels of Resiliency
Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
8Stratus Technologies
Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
9
Resiliency ManagementIt’s Complexity, Multi-Dimensional and more than just Fault Management
Detection
(Prediction)
Localization
Isolation Remediation
(Service restoration)
Recovery (Redundancy restoration)
Resiliency on multiple factors• Speed of Service restoration & Redund. restoration
• State Management: Service continuity
• “Key state” versus “All state”
• Redundancy mode: Resource consumption / cost
• Application performance impact
10Stratus Technologies
Availability
Management
Configuration
Management
Fault management
State Protection
Remembering the preceding events in a given sequence of
interactions within the application
All or partial?
Service Restoration (or Failover)
Insuring that service is restored either through a fast restart or
failover to an active secondary or hotStandy
The speed of Service Restoration depends on the type of
application
Some applications need State Protection, most
applications need fast Service Restoration
Multi-dimensional aspects of ResiliencyTwo Key Elements: Service Restoration and State protection
11Stratus Technologies
Sta
te P
rote
cti
on
No
Sta
te P
rote
cti
on
Sta
te M
an
ag
em
en
t
Slow (mins)
Start from reset
Key state stored on
disk
Re-instantiation afterfailure: No Standby
“OSS, Billing”
“Web server”
Multi-dimensional aspects of ResiliencyState Protection versus Service Restoration
Types of State Protection Full state protection
Key state protection
No state protection
(Stateless)
State Management has
implications on Transparency
Performance
Resources
“Cold Standby”
Service Restoration Speed
12Stratus Technologies
Sta
te P
rote
cti
on
No
Sta
te P
rote
cti
on
Service Restoration Speed
Sta
te M
an
ag
em
en
t
Slow (mins)
Start from reset
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state stored on
disk
Pre-instantiated Before failure: Failover to running Standby
“OSS, Billing” “email, SMS”
“Web server”“vCE Router
Forwarder”
“Cold Standby” “Warm Standby”
Types of State Protection Full state protection
Key state protection
No state protection
(Stateless)
State Management has
implications on Transparency
Performance
Resources
Re-instantiation afterfailure: No Standby
Multi-dimensional aspects of ResiliencyState Protection versus Service Restoration
13Stratus Technologies
Sta
te P
rote
cti
on
No
Sta
te P
rote
cti
on
Sta
te M
an
ag
em
en
t
Slow (mins)Fast (msecs)
Start from reset
Failover + key state
reload
Failover Full VM state in
RAM
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state stored on
disk
Se
rvic
e
Acce
ssib
ility
Se
rvic
e
Co
ntin
uity
“Warm Standby” “Hot Standby or
Active-Active”
“OSS, Billing” “email, SMS”“Voice control,
Router Control”
“Web server”“vPE Router
Forwarder”
“vCE Router
Forwarder”
“Cold Standby”
Pre-instantiated Before failure: Failover to running Standby
Types of State Protection Full state protection
Key state protection
No state protection
(Stateless)
State Management has
implications on Transparency
Performance
Resources
Re-instantiation afterfailure: No Standby
To do Fast Remediation
you need
Pre-instantiation
State management
Service Restoration Speed
Multi-dimensional aspects of ResiliencyState Protection versus Service Restoration
14Stratus Technologies
Immense Pain Loss ofConsciousness
Loss ofBodily Control
TemporaryBrain Loss
Fault Tolerant Systems Provide Service Continuity, Even During Failures
Failure
Cold Restart versus Hot Standby or Active-Active ……it’s like surviving a heart attack versus preventing one
Cold Restart(Instant HA)
Hot StandbyOr Active-Active(Fault Tolerant)
msecs secs mins hours days
Fully ProtectedBackup Activated -
UnprotectedRestored to Fully Protected Redundancy
Customer Affecting Application Outage NormalApp Restart
All state is Lost
All state is Preserved
15
Re-instantiation after failure: No Standby
Pre-instantiated Before failure: Failover to running Standby
Stratus Technologies Confidential
State protectionGuaranteeing Globally Consistent State
Different ways to describe StatePointing
• Active-Standby synchronous VM replication
• Also known Checkpointing with I/O barrier, I/O lock-stepping or
buffering
What does it guarantee
• Application transparency
• IO barrier prevents all external communications from the
speculative execution prior to state replication
• Consistent VM memory replica between act-standby and hot-
standby, at the confirmed statepoint
16
We call it StatePointing (VM replication)Providing Service Continuity with fast Service Restoration
VM instances paired between primary and secondary hosts in the cloud infrastructure
State of primary (active) captured regularly and applied to secondary (HotStandby)
StatePoint™ = VM Checkpoint + I/O StateStepping
• Provides globally consistent state
Fast service restoration from the most recent StatePoint upon primary failover to secondary
Automatic redundancy restoration through third host instantiation
Hot Standby Host
SP N-1
If the primary host fails, it automatically switches to the secondary host
Active Host
Guest Run
Epoch N-1
Guest Run
Epoch NSP N-1
SP N
SP N
Guest Run
Epoch N+1
Guest Run
Epoch N+2
Guest Run
Epoch N+1
SP N+1
Third Host(created post primary failure)
17
Guest From
Image
SP N+X
SP N+1 SP N+X
17
Active host
Hot Standby host
Act.-Stby. & Egress Network Traffic
n-1 n+1
QEMU Monitor
n
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed
PC
R
PC
R
PC
R
Insert n
PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended
Note: For simplicity, n-2 interactions are not shown.
18
P1
P2
P3
P4
P5
P5
QEMU (Standby)
Network EgressQueue
[snapshots]
QEMU (Active)
Enqueue
Insert n-1 state I/O barrier
P1
P2
P3
P4
P5 P1
P2
P3
P4
P1
P2
P3
Guest VM(Active)
Insert n+1 barrier
n-1 I/O barrier Still onn-1 I/O barrier removed
n I/O barrier still onn I/O barrier removed
Multiple levels of resiliency Ensures flexibility and resource optimization based of applications
Deliver Availability as an
infrastructure service to virtual and
cloud ecosystems
Firewall MME IMS Web Server
While every VNF needs Fault
Management, not all need state
protection
VNF-CForwarding
Element
VNF-CForwarding
Element
VNF-CForwarding
Element
VNF-CControlElement
Monolithic
VNFs
De-composed VNFs (separate control and forwarding
elements)
Stateless Fast Path
Forwarding
Elements
Stateful
Control
Element
Fault
Tolerant(includes State
protection)
High
Availability(no State
protection)
Unprotected
Modes of
protection
19Stratus Technologies
Commodity
High Volume
Networking
Virtualization
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
Linux
EP
C
Linux
PC
RF
Linux
HS
S
Linux
IMS…
Linux
Op
tica
l T
ran
sp
ort
Con
tro
l P
lan
e
Linux
L3
Rou
tin
g
Con
tro
l P
lan
e
Linux
Bill
ing
Linux
Cu
sto
me
r C
are
Linux
NO
C
Linux
L2
Sw
itch
ing
Con
tro
l P
lan
e
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
NFV
Stratus Node Resiliency Services (NRS)
Protection with Application transparency, no code changesResiliency Functionality in the NFVI nodes & managed in the MANO
20
Stratus
Resiliency Management
Services (RMS)
MANO
OpenStackenvironment
The Stratus Approach has implemented enhancements in KVM and plug-ins in OpenStack to make it seamless for the VNFs
Stratus Technologies
SW Infrastructure Resiliency Management
• Fault protection for all applications, no required code changes for most apps
• State Protection, offering globally consistent state
• Multiple levels of Resiliency – Software Defined Availability (SDA) Control vs. Forwarding element, Stateful vs. stateless, etc
Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
Benefits of Resiliency Managementthat includes Fault Management, Availability Management and Configuration Management
21Stratus Technologies
Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
22
The Stratus led PoC (ETSI PoC#35)
Participants of PoC#35
23
Availability Management with Stateful Fault Tolerance• Demonstrated at NFV World Congress May 6-8 in San Jose, CA
OpenStack Summit, May 2015, Vancouver, Canada
SDN World Congress Oct 2015, Dusseldorf, Germany
• Completed 7/31/2015, final reported submitted
http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance
Stratus Technologies
24
OpenStack based VIM mechanisms alone are insufficient for supporting
carrier grade resiliency, but Stratus Cloud Technology solves that and
provided stateful failover enabling service continuity with acceptable QoS
• Service Restoration in millisecs
• Redundancy Restoration in seconds
Any non resilient VNF can be made instantaneously Resilient with no code
change (as long as it is OpenStack ready and there is no standard way to
package VNF)
Multiple levels of Resiliency can be easily provided using Software Defined
Resiliency in the Infrastructure, based on application requirement for State
and service restoration speed
What we proved with PoC#35
Stratus Technologies