Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Approach 1: Flume-style Push-based Approach
Configuring Flume
agent.sinks = avroSink agent.sinks.avroSink.type = avro agent.sinks.avroSink.channel = memoryChannel agent.sinks.avroSink.hostname =agent.sinks.avroSink.port =
Configuring Spark Streaming Application Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact
groupId = org.apache.spark artifactId = spark-streaming-flume_2.10 version = 1.1.0
Programming: In the streaming application code, import FlumeUtils and create input DStream as follows.
import org.apache.spark.streaming.flume.*; JavaReceiverInputDStreamflumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]); Note that the hostname should be the same as the one used by the resource manager in the cluster , so that resource allocation can match the names and launch the receiver in the right machine
Deploying: Package spark-streaming-flume_2.10 and its dependencies (except spark-core_2.10 and spark-streaming_2.10 which are provided by spark-submit) into the application JAR. Then use spark-submit to launch your application