Reach2Ramesh: Flume and Spark Integration

Wednesday, January 14, 2015

Flume and Spark Integration


Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data.

 Approach 1: Flume-style Push-based Approach 
 When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.

Flume can be configured to push data to a port on that machine.

Due to the push model, the streaming application needs to be up, with the receiver scheduled and listening on the chosen port, for Flume to be able push data.

Configuring Flume

agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
agent.sinks.avroSink.hostname = 
agent.sinks.avroSink.port =


Configuring Spark Streaming Application
Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact

groupId = org.apache.spark
artifactId = spark-streaming-flume_2.10
version = 1.1.0

Programming: In the streaming application code, import FlumeUtils and create input DStream as follows.

import org.apache.spark.streaming.flume.*;
JavaReceiverInputDStream flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]);
Note that the hostname should be the same as the one used by the resource manager in the cluster , so that resource allocation can match the names and launch the receiver in the right machine


Deploying: Package spark-streaming-flume_2.10 and its dependencies (except spark-core_2.10 and spark-streaming_2.10 which are provided by spark-submit) into the application JAR. Then use spark-submit to launch your application

Reach2Ramesh

My Quotes

Wednesday, January 14, 2015

Flume and Spark Integration

No comments :