Learning Apache Spark 2 by Muhammad Asif Abbasi
Author:Muhammad Asif Abbasi [Abbasi, Muhammad Asif]
Language: eng
Format: azw3
Publisher: Packt Publishing
Published: 2017-06-06T04:00:00+00:00
The complexities of the DStream API resulted in the vision of creating an API that works for both batch and streaming and with the user simply mentioning the intention to ingest data in streaming or batch fashion, and letting the Spark framework worry about the integrity details.
Just think for a moment... what is a stream of data?... It is typically similar structure data arriving in multiple batches where sometimes the batch can be short (5 minutes) and sometimes the batch can be long (say 1 hour/4 hours, and so on). I remember while working with Teradata (my old employer), we were working with applications where data would arrive in mini-batches and get appended to the same table. The table was considered an infinite table (despite always being of a finite size). The same approach has now been bought to Spark streaming, essentially creating an infinite table. You might want to read the whole table, or just the freshly arriving data.
The Spark framework has adopted a similar approach by removing all differences from the Batch and Streaming APIs. The only difference is that in streaming the table is unbounded, while in batch applications the table is considered static.
Let's compare and contrast the Streaming and Batch APIs for a moment, and understand the intricacies of converting a batch application into a streaming application. We'll revert back to our old CDR dataset, and continuously report on calls originating from London with a revenue of over 200 cents. Let's look at the code example of a batch application before looking at the streaming application.
The following is a batch application:
val cdrSchema = new StructType().add("OriginNum","String").add("TermNum","String").add("Origin","String").add("Term","String").add("CallTs","String").add("CallCharge","integer") val cdrs = spark.read.schema(cdrSchema).json("/home/spark/sampledata/json/cdrs.json") val callsFromLondon = cdrs.select("Origin","Dest","CallCharge").where("CallCharge > 200").where("Origin = 'London'") callsFromLondon.write.format("parquet").save("/home/spark/sampledata/streaming/output")
I am using the preceding code example in the batch application, and have recently been asked by the management to build a report using the preceding information, that refreshes every 1 minute. In the old days, I would have had to go through a lot of hassle and have a rethink on the architecture of the application before I could convert it into a streaming job. Let's attempt that with Spark using Structured Streaming and see the beauty and simplicity of it:
val cdrSchema = new StructType().add("OriginNum","String").add("TermNum","String").add("Origin","String").add("Term","String").add("CallTs","String").add("CallCharge","integer") val cdrs = spark.readStream.schema(cdrSchema).json("/home/spark/sampledata/json/cdrs.json") val callsFromLondon = cdrs.select("Origin","Dest","CallCharge").where("CallCharge > 200").where("Origin = 'London'") callsFromLondon.writeStream.format("parquet").start("/home/spark/sampledata/streaming/output")
At the time of writing, the preceding code works for Spark 2.0.0; however, this API is experimental and subject to change. So please do not use this in your production workloads until it has been released for general availability. As you can see, converting a batch application into streaming is relatively simple.
Let's look at this from an architectural perspective.
Spark Streaming now treats the input data as an infinite unbounded table, with every new item in the stream appended to the bottom of the table as a new row. From a developer's perspective, they are still working with a static table to obtain a final result set, which is written to an output sink. However, any query that the developer writes on
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8309)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6796)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6774)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6660)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6445)
Driving Data Quality with Data Contracts by Andrew Jones(6387)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6145)
Learning SQL by Alan Beaulieu(6004)
Weapons of Math Destruction by Cathy O'Neil(5795)
Big Data Analysis with Python by Ivan Marin(5392)
Data Engineering with dbt by Roberto Zagni(4397)
Solidity Programming Essentials by Ritesh Modi(4047)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3903)
Pandas Cookbook by Theodore Petrou(3607)
Blockchain Basics by Daniel Drescher(3306)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2914)
Feature Store for Machine Learning by Jayanth Kumar M J(2819)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2803)
Mastering Python for Finance by Unknown(2748)
