IOT meetup presentation

Post on 09-Jan-2017

37 views 0 download

Transcript of IOT meetup presentation

Handling IOT Data with a Modern Data Architecture

Cliff Gilmore - Data Practice Director @ 1904labs

Capture, Process and Serve (All the Things)

Challenges of IOT Data

3

Scale

Frequency of events

Size of Data

Number of Devices

Number of Users

Latency Demands

Geo Distribution

Processing

Batch Analytics

Realtime Analytics

Aggregations

Machine Learning

Reporting

Applications

Realtime Access

Report Visualization

Production Analytics

ML Driven Decisions

Microservices

IOT Data ArchitectureLeading to Lambda and Kappa

Architectural Requirements

❏ Must scale out linearly on ingestion, processing, storage, and access.

❏ Need to be able to store huge amounts of data organized for different access patterns.

❏ Must have the ability to process data inflight for real time decision making, alerting and pattern matching

❏ Need to serve the data to the rest of the organization through a common API/Service

❏ The architecture must be agile to accommodate new changes to business logic and processing algorithms

Stack Components

❏ Distributed Log / Queue❏ A pub/sub partitioned queue❏ Kafka is the defacto choice due to it’s wide use in production

❏ Stream Processing❏ Ability to process events as they arrive❏ Event at a Time

❏ Samza, Flink, Storm❏ Micro Batch

❏ Spark Streaming❏ Batch Processing

❏ Process event history in bulk❏ Spark, MapReduce on top of HDFS or Wide Column Stores

❏ Serving❏ Expose data to the rest of the organization and serve application requests❏ Wide Column Store

❏ Cassandra, HBase, BigTable❏ Can also be RDBMS for some data sets (Reports, BI Rollups, etc)

Lambda Architecture

EventsDistributed

Log

Batch Layer

Speed Layer Serving LayerRaw Data, Pattern Matching and Aggregates

Patterns, Rollups, Recommendations

Kappa Architecture

EventsDistributed

Log

Bat

ch

Streaming Serving

Stream Results

Stream V1Stream V2

Table V1Table V2Raw DataRaw

Cassandra toServe IOT Data

The Art of Time Series

Why Cassandra?

❏ Proven linear scale up to 1000s of nodes in a single cluster

❏ Geo redundancy to collect data where it is created and replicate across the globe

❏ High capacity to ingest parallel individual writes

❏ Low latency and high throughput reads

❏ Wide-column store data model allows for data to be structured around query patterns

❏ Continuous availability suited to and used for the most mission critical systems

❏ AP platform by definition of the CAP theorem, consistency is tunable to give availability

Cassandra 101

DC1 DC2

Physical Data Model

Partition Key

Clustering Key Val1

Col1:ValCol2:ValCol3:ValCol4:ValCol5:ValCol6:Val

….

Clustering Key Val2

Col1:ValCol6:Val

….

Clustering Key Val3

Col6:Val….

Clustering Key Val4

Col1:ValCol2:ValCol3:ValCol4:ValCol5:ValCol6:Val

….

Clustering Key Val5

Col1:ValCol2:Vall

….

...

….

CQL - Cassandra Query Language

❏ Simple to use language that looks like SQL

❏ No joins, group by etc

❏ Example Queries

❏ SELECT * FROM readings WHERE event_time > ? AND event_tiime <= ? WHERE device_id= ?;

❏ INSERT INTO readings (device_id, event_time, temperature) VALUES (?,?,?);

I’ve got this!

TimeSeries Table Example

CREATE TABLE readings (

sensor_id text,

event_time TimeUUID,

temperature decimal,

PRIMARY KEY (sensor_id,event_time)

);

TimeSeries Table Example

CREATE TABLE readings (

sensor_id text,

event_time TimeUUID,

temperature decimal,

PRIMARY KEY (sensor_id,event_time)

);

Time Ordered Sortable UUID

TimeSeries Table Example

CREATE TABLE readings (

sensor_id text,

event_time TimeUUID,

temperature decimal,

PRIMARY KEY (sensor_id,event_time)

);

Partition Key

TimeSeries Table Example

CREATE TABLE readings (

sensor_id text,

event_time TimeUUID,

temperature decimal,

PRIMARY KEY (sensor_id,event_time)

);

Clustering Key

Physical Data Model

Station #1

12:05.15

15.9 C

12:05.16

15.9 C

12:05.17

16.0 C

12:05.18

16.1 C

12:05.19

16.0 C

...

….

Station #2

12:05.15

22.0 C

12:05.20

22.1 C

12:05.25

27.9 C

12:05.30

27.7 C

12:05.35

30.2 C

...

….

Advanced Data Model Topics

❏ Consider bucketing time in Partition Key if sample rate is high

❏ Primary Key ((device_id,year,week),event_time)

❏ If per event granularity not needed can batch or rollup events

❏ Primary Key (device_id, event_minute)

❏ If we batch events

❏ JSON blob of sensor readings within the minute

❏ Can’t update sensor readings without read-before-write

Questions?cliff.gilmore@1904labs.com