­

My Quotes


When U were born , you cried and the world rejoiced
Live U'r life in such a way that when you go
THE WORLD SHOULD CRY





Free users please don't remove our link. Get the code again.

Wednesday, January 14, 2015

Flume and Spark Integration



Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data.


 Approach 1: Flume-style Push-based Approach 
  • When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
  • Flume can be configured to push data to a port on that machine.
  • Due to the push model, the streaming application needs to be up, with the receiver scheduled and listening on the chosen port, for Flume to be able push data.

  • Configuring Flume
    agent.sinks = avroSink
    agent.sinks.avroSink.type = avro
    agent.sinks.avroSink.channel = memoryChannel
    agent.sinks.avroSink.hostname = 
    agent.sinks.avroSink.port = 
    

    
    Configuring Spark Streaming Application
    Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact
    
    
    groupId = org.apache.spark
    artifactId = spark-streaming-flume_2.10
    version = 1.1.0
    

    Programming: In the streaming application code, import FlumeUtils and create input DStream as follows.
    import org.apache.spark.streaming.flume.*;
    JavaReceiverInputDStream flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]);
    Note that the hostname should be the same as the one used by the resource manager in the cluster , so that resource allocation can match the names and launch the receiver in the right machine
    

    
    Deploying: Package spark-streaming-flume_2.10 and its dependencies (except spark-core_2.10 and spark-streaming_2.10 which are provided by spark-submit) into the application JAR. Then use spark-submit to launch your application