Processing Data from MQ with Spark Streaming: Part 2 - Brief Discussion on Apache Spark Streaming and Use-cases

Stacey Ronaghan
3 min readApr 25, 2018

--

This is a multi-part series that provides information on messaging, including fault-tolerance techniques, and provide instruction and code to implement a connection between IBM MQ (formerly WebSphere MQ) and Spark Streaming.

Part 1 — Introduction to Messaging, JMS & MQ

Part 2 — Brief Discussion on Apache Spark Streaming and Use-cases

Part 3 — Reliable Delivery & Recovery Techniques with Spark Streaming

Part 4 — Implementation details for Spark MQ Connector

Part 2 — Brief Discussion on Apache Spark Streaming and Use-cases

Apache Spark

Apache Spark is a general purpose, open-source, in-memory parallel compute framework for batch processing. It is widely acknowledged to be the successor to MapReduce and can be run standalone, on Apache Mesos, or on Apache Hadoop. Data can be accessed from various sources including HDFS, S3, Cassandra, HBase. Spark supports a variety of programming languages with a set of useful APIs:

- Spark SQL can be used to build SQL queries on non-relational distributed datasets; this is very useful for exploration as well as data transformation.

- MLlib has a range of machine learning capabilities, allowing users to train models on large data sets and provide predictive and prescriptive analytics

- GraphX represents the data as nodes and edges, such as people and their relationships. In this format, it can perform graph analysis

- Spark Streaming allows for near real-time analytics, processing live streams of data

Apache Spark Streaming

As defined in the documentation, Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Out-of-the-box, Spark Streaming can recover lost work using checkpointing and provides operator states to support function such as sliding windows. It allows the developer to write streaming applications like they would write Spark applications; other APIs such as Spark SQL can be utilized in a Spark streaming application and output can be written to filesystems or databases as with normal Spark applications.

There is build-in support for ingesting data from many sources including Kafka, Flume, HDFS/S3 and Twitter. In addition, Spark Streaming allows a developer to implement a custom receiver by extending Spark’s Receiver class. This is what was used to create the Spark MQ connector discussed in this blog.

Spark Streaming does not support event-based streaming, instead when an input stream is received by the Spark streaming application it is broken up into batches of data, based on a specified increment of time, referred to as micro-batches. These are feed into the Spark Engine to be processed and any final results to be returned are done so in a stream of batches.

Spark Streaming Use-cases

A combination of Spark Streaming and the other APIs can be very powerful. Transforming unstructured or semi-structured data with Spark SQL and Spark Streaming enables dashboards to show transport time tables, or packaging tracking. Similarly, once a model has been built using MLlib, new data can be scored to predict whether a transaction is fraud or whether to recommend a particular product.

This Spark MQ connector code inspiring this blog series was developed for a client who wanted to extract and transform data from XML messages and save in a tabular format to enable better analytics and reporting. They were previously storing in the raw XML format and finding it difficult and time-consuming to get the required data for ad-hoc analysis.

Similarly, another client is using this solution to parse and transform data to provide a dashboard informing the user of various statuses and metrics relating to their products.

There are many open source packages available with Spark to parse documents in various formats (JSON, XML, etc.) as well as save to external storage (HDFS, Hive, HBase, etc.). This enables customers to process streaming data in a variety of different formats in near-real time.

Next Up: Part 3: Reliable Delivery & Recovery Techniques with Spark Streaming

--

--

Stacey Ronaghan
Stacey Ronaghan

Written by Stacey Ronaghan

Data Scientist keen to share experiences & learnings from work & studies

No responses yet