My Quotes


When U were born , you cried and the world rejoiced
Live U'r life in such a way that when you go
THE WORLD SHOULD CRY






Tuesday, April 4, 2017

Fetch TWITTER data using R

Fetch Twitter data using R


  1. twitteR Package:
    1. One of the available package in R for fetching Twitter Data. The package can be obtained from CRAN.R.PROJECT
    2. This package allows us to make REST API calls to twitter using the ConsumerKey & ConsumerSecret code. Code below illustrates
      how to extract the Twitter Data.
    3. This package offers below functionality:
      1. Authenticate with Twitter API
      2. Fetch User timeline
      3. User Followers
      4. User Mentions
      5. Search twitter
      6. User Information
      7. User Trends
      8. Convert JSON object to dataframes
  2. REST API CALLS using R - twitteR package:
    1. Register your application with twitter.
    2. After registration, you will be getting ConsumerKey & ConsumerSecret code which needs to be used for calling twitter API.
    3. Load TwitteR library in R environment.
    4. Call twitter API using OAuthFactory$new() method with ConsumerKey & ConsumerSecret code as input params.
    5. The above step will return an authorization link, which needs to be copied & pasted in the internet browser.
    6. You will be redirected to Twitter application authentication page where you need to authenticate yourself by providing you twitter credentials.
    7. After authenticating , we will be provided with a Authorization code, which needs to be pasted in the R console.
    8. Call registerTwitterOAuth().
    9. friends information
    10. Location based
  3. Source Code:
    library(twitteR)
    requestURL <-  "https://api.twitter.com/oauth/request_token"
    accessURL =    "https://api.twitter.com/oauth/access_token"
    authURL =      "https://api.twitter.com/oauth/authorize"
    consumerKey =   "XXXXXXXXXXXX"
    consumerSecret = "XXXXXXXXXXXXXXXX"
    twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                                 consumerSecret=consumerSecret,
                                 requestURL=requestURL,
                                 accessURL=accessURL,
                                 authURL=authURL)
    download.file(url="http://curl.haxx.se/ca/cacert.pem",
                  destfile="cacert.pem")
    twitCred$handshake(cainfo="cacert.pem")
    save(list="twitCred", file="twitteR_credentials")
    load("twitteR_credentials")
    registerTwitterOAuth(twitCred)#Register your app with Twitter.
    
    
  4. StreamR Package:
    1. This package allows users to fetch twitter Data in real time by connecting to Twitter Stream API.
    2. We can obtain the package from STREAM.R.PROJECT
    3. Few important functions this package offers are: it allows R users to access Twitter's search streams,user streams, parse the output into data frames.
    4. filterStream() - filterStream method opens a connection to Twitter’s Streaming API that will return public statuses that match one or more filter predicates like search keywords.
    5. Tweets can be filtered by keywords, users, language, and location.
    6. The output can be saved as an object in memory or written to a text file.
    7. parseTweets() - This function parses tweets downloaded using filterStream, sampleStream or userStream and returns a data frame.
  5. Below code example shows how to fetch data in real time using RStream:
    library(streamR)
    library(twitteR)
    load("twitteR_credentials")  # make using the save credentials in the previous code.
    registerTwitterOAuth(twitCred)
    filterStream(file.name = "tweets.json", track = "#bigdata",timeout = 0, locations=c(-74,40,-73,41), oauth = twitCred)
    Executing the above will capturing Tweets on "#bigdata" from "NEW YORK" location. Here when we mention timeout=0, we are setting it to fetch continuously, to fetch records for certain time then use timeout=300 (fetches data for 300 secs)
    To Parse the fetched tweets use the below code:
    tweets.df <- parseTweets("tweets.json")
    
    

Tuesday, March 21, 2017

Technology stack of BigData

Docker versus Kubernetes - comparison

  1. Introduction
    1. Containers have become popular thanks to their focus on consistency across platforms from development to production.
    2. The rise in interest to containers has in turn brought in higher demands for their deployment and management.
    3. The need for better control attracted a number of software options as solutions for container orchestration, which allows for abstraction of individual containers to services with a number of instances or replicas.
    4. Two of the major players developing container orchestration are Docker and Kubernetes.
  2. Kubernetes
    1. Kubernetes is an open-source platform for container deployment automation, scaling, and operations across clusters of hosts. The production ready orchestrator draws on Google’s
      extensive experience of years of working with Linux containers.
    2. Kubernetes aims to provide the components and tools to relieve the burden of running applications in public and private clouds by grouping containers into logical units. Their strengths lie in flexible growth, environment agnostic portability, and easy scaling.
  3. Docker Swarm
    1. Swarm is the native clustering for Docker. Originally Docker Swarm did not provide much in the sense of container automation, but with the update to Docker Engine 1.12, container orchestration is now built into its core with first party support.
    2. Docker Swarm is designed around four core principles:
      1. Simple yet powerful with a “just works” user experience,
      2. Resilient zero single-point-of-failure architecture,
      3. Secure by default with automatically generated certificates, and
      4. Backwards compatibility with existing components.
    3. The promise of backwards compatibility is especially important to the existing users. Any tools or containers that work with Docker run equally well in Docker Swarm.
  4. Comparisons
    Although both orchestrators provide much of the same functionality to one another, there are fundamental differences in between how the two operate. Below are listed some of the most notable points on where these rivals diverge.


  5. Summary
    1. Throughout the comparison, it is possible to note how Kubernetes and Docker Swarm fundamentally differ.
    2. Swarm focuses on ease of use with integration with Docker core components while Kubernetes remains open and modular.
    3. The same difference can be noticed while installing and configuring each of the orchestrators.
    4. Docker provides a simple solution that is fast to get started with while Kubernetes aims to support higher demands with higher complexity.
    5. For much of the same reasons, Docker has been popular among developers who prefer simplicity and fast deployments.
    6. At the same time, Kubernetes is used in production environments by many high profile internet companies running popular services

Friday, March 3, 2017

Kafka Multiple Topic

Producer and multiple Topics

    Download a recent stable version of Apache Kafka
  1. Untar the package
  2. Enter into Kafka directory
  3. Start Zookeeper Server
  4. bin/zookeeper-server-start.sh config/zookeeper.properties
  5. In a different terminal start Kafka Server
  6. bin/kafka-server-start.sh config/server.properties
  7. Create a topic test (if not exists)
  8. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example2
  9. Create a topic test1 (if not exists)
  10. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example3
  11. Start consumer on topic test
  12. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic example2 --from-beginning
  13. Start consumer on topic test1
  14. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic example3 --from-beginning
  15. Run mvn clean compile exec:exec
  16. You should see the message in the consumer terminal.
    Here are the project sources
  1. Project POM.xml
    
    
      4.0.0
     
      ../pom.xml
      1.0.0-SNAPSHOT
      com.example.kafka
      kafka-examples
     
      kafka-producer-multiple-topics
      kafka-producer-multiple-topics
    
     
      
       
        org.codehaus.mojo
        exec-maven-plugin
        
         
          
           exec
          
         
        
        
         java
         
          -classpath
          
          com.example.kafka.ProducerMultipleTopic
         
        
       
      
     
    
    
    

  2. Producer Java File
    package com.example.kafka;
    
    import java.util.Properties;
    
    import org.apache.kafka.clients.producer.KafkaProducer;
    import org.apache.kafka.clients.producer.ProducerRecord;
    import org.apache.kafka.common.serialization.StringSerializer;
    
    public class ProducerMultipleTopic {
    
        public static void main(String[] args) {
    
            Properties props = new Properties();
            props.put("bootstrap.servers", "localhost:9092");
            props.put("key.serializer", StringSerializer.class.getName());
            props.put("value.serializer", StringSerializer.class.getName());
            
            KafkaProducer prod = new KafkaProducer(props);
    
            ProducerRecord data1 = new ProducerRecord("example2", "example2");
            ProducerRecord data2 = new ProducerRecord("example3", "example3");
    
            prod.send(data1);
            prod.send(data2);
    
            prod.close();
        }
    }
    
  3. Logger file
    # Root logger option
    log4j.rootLogger=INFO, stdout
    
    # Direct log messages to stdout
    log4j.appender.stdout=org.apache.log4j.ConsoleAppender
    log4j.appender.stdout.Target=System.out
    log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
    log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
    
    


  4. This is the PARENT POM
    
    
     4.0.0
     com.example.kafka
     kafka-examples
     1.0.0-SNAPSHOT
     pom
    
     
      
       org.apache.kafka
       kafka_2.10
       0.9.0.1
      
      
       org.apache.kafka
       kafka-clients
       0.9.0.1
      
      
       log4j
       log4j
       1.2.17
      
     
    
     
      UTF-8
     
    
     
      
       
        
         org.apache.maven.plugins
         maven-compiler-plugin
         3.2
         
          1.7
          1.7
         
        
        
         org.codehaus.mojo
         exec-maven-plugin
         1.3.2
        
       
      
     
     
       kafka-producer-multiple-topics