Learning Apache Spark 2 by Muhammad Asif Abbasi

Learning Apache Spark 2 by Muhammad Asif Abbasi

Author:Muhammad Asif Abbasi [Abbasi, Muhammad Asif]
Language: eng
Format: azw3
Publisher: Packt Publishing
Published: 2017-06-06T04:00:00+00:00


The complexities of the DStream API resulted in the vision of creating an API that works for both batch and streaming and with the user simply mentioning the intention to ingest data in streaming or batch fashion, and letting the Spark framework worry about the integrity details.

Just think for a moment... what is a stream of data?... It is typically similar structure data arriving in multiple batches where sometimes the batch can be short (5 minutes) and sometimes the batch can be long (say 1 hour/4 hours, and so on). I remember while working with Teradata (my old employer), we were working with applications where data would arrive in mini-batches and get appended to the same table. The table was considered an infinite table (despite always being of a finite size). The same approach has now been bought to Spark streaming, essentially creating an infinite table. You might want to read the whole table, or just the freshly arriving data.

The Spark framework has adopted a similar approach by removing all differences from the Batch and Streaming APIs. The only difference is that in streaming the table is unbounded, while in batch applications the table is considered static.

Let's compare and contrast the Streaming and Batch APIs for a moment, and understand the intricacies of converting a batch application into a streaming application. We'll revert back to our old CDR dataset, and continuously report on calls originating from London with a revenue of over 200 cents. Let's look at the code example of a batch application before looking at the streaming application.

The following is a batch application:

val cdrSchema = new StructType().add("OriginNum","String").add("TermNum","String").add("Origin","String").add("Term","String").add("CallTs","String").add("CallCharge","integer") val cdrs = spark.read.schema(cdrSchema).json("/home/spark/sampledata/json/cdrs.json") val callsFromLondon = cdrs.select("Origin","Dest","CallCharge").where("CallCharge > 200").where("Origin = 'London'") callsFromLondon.write.format("parquet").save("/home/spark/sampledata/streaming/output")

I am using the preceding code example in the batch application, and have recently been asked by the management to build a report using the preceding information, that refreshes every 1 minute. In the old days, I would have had to go through a lot of hassle and have a rethink on the architecture of the application before I could convert it into a streaming job. Let's attempt that with Spark using Structured Streaming and see the beauty and simplicity of it:

val cdrSchema = new StructType().add("OriginNum","String").add("TermNum","String").add("Origin","String").add("Term","String").add("CallTs","String").add("CallCharge","integer") val cdrs = spark.readStream.schema(cdrSchema).json("/home/spark/sampledata/json/cdrs.json") val callsFromLondon = cdrs.select("Origin","Dest","CallCharge").where("CallCharge > 200").where("Origin = 'London'") callsFromLondon.writeStream.format("parquet").start("/home/spark/sampledata/streaming/output")

At the time of writing, the preceding code works for Spark 2.0.0; however, this API is experimental and subject to change. So please do not use this in your production workloads until it has been released for general availability. As you can see, converting a batch application into streaming is relatively simple.

Let's look at this from an architectural perspective.

Spark Streaming now treats the input data as an infinite unbounded table, with every new item in the stream appended to the bottom of the table as a new row. From a developer's perspective, they are still working with a static table to obtain a final result set, which is written to an output sink. However, any query that the developer writes on



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.