Reach2Ramesh: 2017

Tuesday, April 4, 2017

Fetch TWITTER data using R

Fetch Twitter data using R

twitteR Package:
1. One of the available package in R for fetching Twitter Data. The package can be obtained from CRAN.R.PROJECT
2. This package allows us to make REST API calls to twitter using the ConsumerKey & ConsumerSecret code. Code below illustrates
  how to extract the Twitter Data.
3. This package offers below functionality:
  1. Authenticate with Twitter API
  2. Fetch User timeline
  3. User Followers
  4. User Mentions
  5. Search twitter
  6. User Information
  7. User Trends
  8. Convert JSON object to dataframes
REST API CALLS using R - twitteR package:
1. Register your application with twitter.
2. After registration, you will be getting ConsumerKey & ConsumerSecret code which needs to be used for calling twitter API.
3. Load TwitteR library in R environment.
4. Call twitter API using OAuthFactory$new() method with ConsumerKey & ConsumerSecret code as input params.
5. The above step will return an authorization link, which needs to be copied & pasted in the internet browser.
6. You will be redirected to Twitter application authentication page where you need to authenticate yourself by providing you twitter credentials.
7. After authenticating , we will be provided with a Authorization code, which needs to be pasted in the R console.
8. Call registerTwitterOAuth().
9. friends information
10. Location based

Source Code:

library(twitteR)
requestURL <-  "https://api.twitter.com/oauth/request_token"
accessURL =    "https://api.twitter.com/oauth/access_token"
authURL =      "https://api.twitter.com/oauth/authorize"
consumerKey =   "XXXXXXXXXXXX"
consumerSecret = "XXXXXXXXXXXXXXXX"
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=requestURL,
                             accessURL=accessURL,
                             authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="cacert.pem")
twitCred$handshake(cainfo="cacert.pem")
save(list="twitCred", file="twitteR_credentials")
load("twitteR_credentials")
registerTwitterOAuth(twitCred)#Register your app with Twitter.

StreamR Package:
1. This package allows users to fetch twitter Data in real time by connecting to Twitter Stream API.
2. We can obtain the package from STREAM.R.PROJECT
3. Few important functions this package offers are: it allows R users to access Twitter's search streams,user streams, parse the output into data frames.
4. filterStream() - filterStream method opens a connection to Twitter’s Streaming API that will return public statuses that match one or more ﬁlter predicates like search keywords.
5. Tweets can be ﬁltered by keywords, users, language, and location.
6. The output can be saved as an object in memory or written to a text ﬁle.
7. parseTweets() - This function parses tweets downloaded using filterStream, sampleStream or userStream and returns a data frame.

Below code example shows how to fetch data in real time using RStream:

library(streamR)
library(twitteR)
load("twitteR_credentials")  # make using the save credentials in the previous code.
registerTwitterOAuth(twitCred)
filterStream(file.name = "tweets.json", track = "#bigdata",timeout = 0, locations=c(-74,40,-73,41), oauth = twitCred)
Executing the above will capturing Tweets on "#bigdata" from "NEW YORK" location. Here when we mention timeout=0, we are setting it to fetch continuously, to fetch records for certain time then use timeout=300 (fetches data for 300 secs)
To Parse the fetched tweets use the below code:
tweets.df <- parseTweets("tweets.json")

Read more...

Tuesday, March 21, 2017

Technology stack of BigData

Read more...

Docker versus Kubernetes - comparison

Introduction
1. Containers have become popular thanks to their focus on consistency across platforms from development to production.
2. The rise in interest to containers has in turn brought in higher demands for their deployment and management.
3. The need for better control attracted a number of software options as solutions for container orchestration, which allows for abstraction of individual containers to services with a number of instances or replicas.
4. Two of the major players developing container orchestration are Docker and Kubernetes.
Kubernetes
1. Kubernetes is an open-source platform for container deployment automation, scaling, and operations across clusters of hosts. The production ready orchestrator draws on Google’s
  extensive experience of years of working with Linux containers.
2. Kubernetes aims to provide the components and tools to relieve the burden of running applications in public and private clouds by grouping containers into logical units. Their strengths lie in flexible growth, environment agnostic portability, and easy scaling.
Docker Swarm
1. Swarm is the native clustering for Docker. Originally Docker Swarm did not provide much in the sense of container automation, but with the update to Docker Engine 1.12, container orchestration is now built into its core with first party support.
2. Docker Swarm is designed around four core principles:
  1. Simple yet powerful with a “just works” user experience,
  2. Resilient zero single-point-of-failure architecture,
  3. Secure by default with automatically generated certificates, and
  4. Backwards compatibility with existing components.
3. The promise of backwards compatibility is especially important to the existing users. Any tools or containers that work with Docker run equally well in Docker Swarm.
Comparisons
Although both orchestrators provide much of the same functionality to one another, there are fundamental differences in between how the two operate. Below are listed some of the most notable points on where these rivals diverge.
Summary
1. Throughout the comparison, it is possible to note how Kubernetes and Docker Swarm fundamentally differ.
2. Swarm focuses on ease of use with integration with Docker core components while Kubernetes remains open and modular.
3. The same difference can be noticed while installing and configuring each of the orchestrators.
4. Docker provides a simple solution that is fast to get started with while Kubernetes aims to support higher demands with higher complexity.
5. For much of the same reasons, Docker has been popular among developers who prefer simplicity and fast deployments.
6. At the same time, Kubernetes is used in production environments by many high profile internet companies running popular services

Read more...

Friday, March 3, 2017

Kafka Multiple Topic

Producer and multiple Topics

Untar the package
Enter into Kafka directory
Start Zookeeper Server
bin/zookeeper-server-start.sh config/zookeeper.properties
In a different terminal start Kafka Server
bin/kafka-server-start.sh config/server.properties
Create a topic test (if not exists)
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example2
Create a topic test1 (if not exists)
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example3
Start consumer on topic test
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic example2 --from-beginning
Start consumer on topic test1
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic example3 --from-beginning
Run mvn clean compile exec:exec
You should see the message in the consumer terminal.

Project POM.xml



  4.0.0
 
  ../pom.xml
  1.0.0-SNAPSHOT
  com.example.kafka
  kafka-examples
 
  kafka-producer-multiple-topics
  kafka-producer-multiple-topics

 
  
   
    org.codehaus.mojo
    exec-maven-plugin
    
     
      
       exec
      
     
    
    
     java
     
      -classpath
      
      com.example.kafka.ProducerMultipleTopic

Producer Java File

package com.example.kafka;

import java.util.Properties;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

public class ProducerMultipleTopic {

    public static void main(String[] args) {

        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer", StringSerializer.class.getName());
        props.put("value.serializer", StringSerializer.class.getName());
        
        KafkaProducer prod = new KafkaProducer(props);

        ProducerRecord data1 = new ProducerRecord("example2", "example2");
        ProducerRecord data2 = new ProducerRecord("example3", "example3");

        prod.send(data1);
        prod.send(data2);

        prod.close();
    }
}

Logger file

# Root logger option
log4j.rootLogger=INFO, stdout

# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

This is the PARENT POM



 4.0.0
 com.example.kafka
 kafka-examples
 1.0.0-SNAPSHOT
 pom

 
  
   org.apache.kafka
   kafka_2.10
   0.9.0.1
  
  
   org.apache.kafka
   kafka-clients
   0.9.0.1
  
  
   log4j
   log4j
   1.2.17
  
 

 
  UTF-8
 

 
  
   
    
     org.apache.maven.plugins
     maven-compiler-plugin
     3.2
     
      1.7
      1.7
     
    
    
     org.codehaus.mojo
     exec-maven-plugin
     1.3.2
    
   
  
 
 
   kafka-producer-multiple-topics

Read more...

My Quotes

Tuesday, April 4, 2017

Fetch TWITTER data using R

Fetch Twitter data using R

Tuesday, March 21, 2017

Technology stack of BigData

Docker versus Kubernetes - comparison

Friday, March 3, 2017

Kafka Multiple Topic