My Quotes

When U were born , you cried and the world rejoiced
Live U'r life in such a way that when you go

Wednesday, January 14, 2015

Flume and Spark Integration

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data.

 Approach 1: Flume-style Push-based Approach 
  • When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
  • Flume can be configured to push data to a port on that machine.
  • Due to the push model, the streaming application needs to be up, with the receiver scheduled and listening on the chosen port, for Flume to be able push data.

  • Configuring Flume
    agent.sinks = avroSink
    agent.sinks.avroSink.type = avro = memoryChannel
    agent.sinks.avroSink.hostname = 
    agent.sinks.avroSink.port = 

    Configuring Spark Streaming Application
    Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact
    groupId = org.apache.spark
    artifactId = spark-streaming-flume_2.10
    version = 1.1.0

    Programming: In the streaming application code, import FlumeUtils and create input DStream as follows.
    import org.apache.spark.streaming.flume.*;
    JavaReceiverInputDStream flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]);
    Note that the hostname should be the same as the one used by the resource manager in the cluster , so that resource allocation can match the names and launch the receiver in the right machine

    Deploying: Package spark-streaming-flume_2.10 and its dependencies (except spark-core_2.10 and spark-streaming_2.10 which are provided by spark-submit) into the application JAR. Then use spark-submit to launch your application