My Quotes

When U were born , you cried and the world rejoiced
Live U'r life in such a way that when you go

Wednesday, January 10, 2018

ELK - Elastic, LogStash and Amazon Kibana - alternative for SPLUNK

ELK - Architecture

For more information on Kibana here is a nice article

  1. Step 1- Install Elasticsearch
    1. Download elasticsearch zip file from
    2. Extract it to a directory (unzip it)
    3. Run it (bin/elasticsearch or bin/elasticsearch.bat on Windows)
    4. Check that it runs using curl -XGET http://localhost:9200
    5. Here's how to do it (steps are written for OS X but should be similar on other systems):
cd elasticsearch-1.7.1
  1. Elasticsearch should be running now. You can verify it's running using curl. In a separate terminal window execute a GET request to Elasticsearch's status page:
curl -XGET http://localhost:9200
  1. If all is well, you should get the following result:
  "status" : 200,
  "name" : "Tartarus",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.1",
    "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
    "build_timestamp" : "2015-07-29T09:54:16Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  "tagline" : "You Know, for Search"
  1. Step 2 - Install Kibana 4
  2. Download Kibana archive from
  3. Please note that you need to download appropriate distribution for your OS, URL given in examples below is for OS X
  4. Extract the archive
  5. Run it (bin/kibana)
  6. Check that it runs by pointing the browser to the Kibana's WebUI
tar xvzf kibana-4.1.1-darwin-x64.tar.gz
cd kibana-4.1.1-darwin-x64
  1. Point your browser to http://localhost:5601 (if Kibana page shows up, we're good - we'll configure it later)
  1. Step 3) Install Logstash
  2. Download Logstash zip from
  3. Extract it (unzip it)
  1. Step 4) Configure Spring Boot's Log File
  2. In order to have Logstash ship log files to Elasticsearch, we must first configure Spring Boot to store log entries into a file.
  3. We will establish the following pipeline: Spring Boot App --> Log File --> Logstash --> Elasticsearch.
  4. There are other ways of accomplishing the same thing, such as configuring logback to use TCP appender to send logs to a remote Logstash instance via TCP, and many other configurations.
  5. Anyhow, let's configure Spring Boot's log file.
  6. The simplest way to do this is to configure log file name in
  7. It's enough to add the following line:
Spring Boot will now log ERROR, WARN and INFO level messages in the application.log log file and will also rotate it as it reaches 10 Mb.
  1. Step 5) Configure Logstash to Understand Spring Boot's Log File Format
  2. Typical Logstash config file consists of three main sections: input, filter and output.
  3. Each section contains plugins that do relevant part of the processing
  4. such as file input plugin that reads log events from a file or elasticsearch output plugin which sends log events to Elasticsearch.
  5. Input section defines from where Logstash will read input data
  6. in our case it will be a file hence we will use a file plugin with multiline codec, which basically means that our input file may have multiple lines per log entry.
input {
  file {
    type => "java"
    path => "/path/to/application.log"
    codec => multiline {
      pattern => "^%{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{TIME}.*"
      negate => "true"
      what => "previous"
  1. Explanation
  2. We're using file plugin.
  3. type is set to java - it's just additional piece of metadata in case you will use multiple types of log files in the future.
  4. path is the absolute path to the log file. It must be absolute - Logstash is picky about this.
  5. We're using multiline codec which means that multiple lines may correspond to a single log event,
  6. In order to detect lines that should logically be grouped with a previous line we use a detection pattern:
  7. pattern => "^%{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{TIME}.*" ? Each new log event needs to start with date.
  8. negate => "true" ? if it doesn't start with a date ...
  9. what => "previous" ? ... then it should be grouped with a previous line.
  10. File input plugin, as configured, will tail the log file (e.g. only read new entries at the end of the file). Therefore, when testing, in order for Logstash to read something you will need to generate new log entries.
  1. Filter Section
  2. Filter section contains plugins that perform intermediary processing on an a log event.
  3. In our case, event will either be a single log line or multiline log event grouped according to the rules described above.
  4. In the filter section we will do several things:
  5. Tag a log event if it contains a stacktrace. This will be useful when searching for exceptions later on.
  6. Parse out (or grok, in logstash terminology) timestamp, log level, pid, thread, class name (logger actually) and log message.
  7. Specified timestamp field and format - Kibana will use that later for time based searches.
  8. Filter section for Spring Boot's log format that aforementioned things looks like this:
filter {
  #If log line contains tab character followed by 'at' then we will tag that entry as stacktrace
  if [message] =~ "\tat" {
    grok {
      match => ["message", "^(\tat)"]
      add_tag => ["stacktrace"]

  #Grokking Spring Boot's default log format
  grok {
    match => [ "message", 
               "(?%{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{TIME})  %{LOGLEVEL:level} %{NUMBER:pid} --- \[(?[A-Za-z0-9-]+)\] [A-Za-z0-9.]*\.(?[A-Za-z0-9#_]+)\s*:\s+(?.*)",
               "(?%{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{TIME})  %{LOGLEVEL:level} %{NUMBER:pid} --- .+? :\s+(?.*)"

  #Parsing out timestamps which are in timestamp field thanks to previous grok section
  date {
    match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss.SSS" ]
  1. Explanation:
  2. if [message] =~ "\tat" ? If message contains tab character followed by at (this is ruby syntax) then...
  3. se the grok plugin to tag stacktraces:
  4. match => ["message", "^(\tat)"] ? when message matches beginning of the line followed by tab followed by at then..
  5. add_tag => ["stacktrace"] ? ... tag the event with stacktrace tag.
  6. Use the grok plugin for regular Spring Boot log message parsing:
  7. First pattern extracts timestamp, level, pid, thread, class name (this is actually logger name) and the log message.
  8. Unfortunately, some log messages don't have logger name that resembles a class name (for example, Tomcat logs) hence the second pattern that will skip the logger/class field and parse out timestamp, level, pid, thread and the log message.
  9. Use date plugin to parse and set the event date:
  10. match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss.SSS" ] ? timestamp field (grokked earlier) contains the timestamp in the specified format
  1. Output Section
  2. Output section contains output plugins that send event data to a particular destination.
  3. Outputs are the final stage in the event pipeline.
  4. We will be sending our log events to stdout (console output, for debugging) and to Elasticsearch.
  5. Compared to filter section, output section is rather straightforward:
output {
  # Print each event to stdout, useful for debugging. Should be commented out in production.
  # Enabling 'rubydebug' codec on the stdout output will make logstash
  # pretty-print the entire event as something similar to a JSON representation.
  stdout {
    codec => rubydebug

  # Sending properly parsed log events to elasticsearch
  elasticsearch {
   hosts => [""]  #  takes an array of hosts (e.g. elasticsearch cluster) as value. 
  1. Putting it all together
  2. Finally, the three parts - input, filter and output - need to be copy pasted together and saved into logstash.conf config file.
  3. Once the config file is in place and Elasticsearch is running, we can run Logstash:
  4. /path/to/logstash/bin/logstash -f logstash.conf
  5. If everything went well, Logstash is now shipping log events to Elasticsearch.
  1. Step 6) Configure Kibana
  2. Ok, now it's time to visit the Kibana web UI again.
  3. We have started it in step 2 and it should be running at http://localhost:5601.
  4. First, you need to point Kibana to Elasticsearch index(s) of your choice.
  5. Logstash creates indices with the name pattern of logstash-YYYY.MM.DD.
  6. In Kibana Settings --> Indices configure the indices:
  7. Index contains time-based events (select this option)
  8. Use event times to create index names (select this option)
  9. Index pattern interval: Daily
  10. Index name or pattern: [logstash-]YYYY.MM.DD
  11. Click on "Create Index"
  12. Now click on "Discover" tab.
  13. It is the places for "Search" because it allows you to perform new searches and also to save/manage them.
  14. Log events should be showing up now in the main window.
  15. If they're not, then double check the time period filter in to right corner of the screen.
  16. Default table will have 2 columns by default: Time and _source.
  17. In order to make the listing more useful, we can configure the displayed columns.
  18. From the menu on the left select level, class and logmessage.
 Here is a sample output screent shot of the kibana console 

Tuesday, April 4, 2017

Fetch TWITTER data using R

Fetch Twitter data using R

  1. twitteR Package:
    1. One of the available package in R for fetching Twitter Data. The package can be obtained from CRAN.R.PROJECT
    2. This package allows us to make REST API calls to twitter using the ConsumerKey & ConsumerSecret code. Code below illustrates
      how to extract the Twitter Data.
    3. This package offers below functionality:
      1. Authenticate with Twitter API
      2. Fetch User timeline
      3. User Followers
      4. User Mentions
      5. Search twitter
      6. User Information
      7. User Trends
      8. Convert JSON object to dataframes
  2. REST API CALLS using R - twitteR package:
    1. Register your application with twitter.
    2. After registration, you will be getting ConsumerKey & ConsumerSecret code which needs to be used for calling twitter API.
    3. Load TwitteR library in R environment.
    4. Call twitter API using OAuthFactory$new() method with ConsumerKey & ConsumerSecret code as input params.
    5. The above step will return an authorization link, which needs to be copied & pasted in the internet browser.
    6. You will be redirected to Twitter application authentication page where you need to authenticate yourself by providing you twitter credentials.
    7. After authenticating , we will be provided with a Authorization code, which needs to be pasted in the R console.
    8. Call registerTwitterOAuth().
    9. friends information
    10. Location based
  3. Source Code:
    requestURL <-  ""
    accessURL =    ""
    authURL =      ""
    consumerKey =   "XXXXXXXXXXXX"
    consumerSecret = "XXXXXXXXXXXXXXXX"
    twitCred <- OAuthFactory$new(consumerKey=consumerKey,
    save(list="twitCred", file="twitteR_credentials")
    registerTwitterOAuth(twitCred)#Register your app with Twitter.
  4. StreamR Package:
    1. This package allows users to fetch twitter Data in real time by connecting to Twitter Stream API.
    2. We can obtain the package from STREAM.R.PROJECT
    3. Few important functions this package offers are: it allows R users to access Twitter's search streams,user streams, parse the output into data frames.
    4. filterStream() - filterStream method opens a connection to Twitter’s Streaming API that will return public statuses that match one or more filter predicates like search keywords.
    5. Tweets can be filtered by keywords, users, language, and location.
    6. The output can be saved as an object in memory or written to a text file.
    7. parseTweets() - This function parses tweets downloaded using filterStream, sampleStream or userStream and returns a data frame.
  5. Below code example shows how to fetch data in real time using RStream:
    load("twitteR_credentials")  # make using the save credentials in the previous code.
    filterStream( = "tweets.json", track = "#bigdata",timeout = 0, locations=c(-74,40,-73,41), oauth = twitCred)
    Executing the above will capturing Tweets on "#bigdata" from "NEW YORK" location. Here when we mention timeout=0, we are setting it to fetch continuously, to fetch records for certain time then use timeout=300 (fetches data for 300 secs)
    To Parse the fetched tweets use the below code:
    tweets.df <- parseTweets("tweets.json")

Tuesday, March 21, 2017

Technology stack of BigData

Docker versus Kubernetes - comparison

  1. Introduction
    1. Containers have become popular thanks to their focus on consistency across platforms from development to production.
    2. The rise in interest to containers has in turn brought in higher demands for their deployment and management.
    3. The need for better control attracted a number of software options as solutions for container orchestration, which allows for abstraction of individual containers to services with a number of instances or replicas.
    4. Two of the major players developing container orchestration are Docker and Kubernetes.
  2. Kubernetes
    1. Kubernetes is an open-source platform for container deployment automation, scaling, and operations across clusters of hosts. The production ready orchestrator draws on Google’s
      extensive experience of years of working with Linux containers.
    2. Kubernetes aims to provide the components and tools to relieve the burden of running applications in public and private clouds by grouping containers into logical units. Their strengths lie in flexible growth, environment agnostic portability, and easy scaling.
  3. Docker Swarm
    1. Swarm is the native clustering for Docker. Originally Docker Swarm did not provide much in the sense of container automation, but with the update to Docker Engine 1.12, container orchestration is now built into its core with first party support.
    2. Docker Swarm is designed around four core principles:
      1. Simple yet powerful with a “just works” user experience,
      2. Resilient zero single-point-of-failure architecture,
      3. Secure by default with automatically generated certificates, and
      4. Backwards compatibility with existing components.
    3. The promise of backwards compatibility is especially important to the existing users. Any tools or containers that work with Docker run equally well in Docker Swarm.
  4. Comparisons
    Although both orchestrators provide much of the same functionality to one another, there are fundamental differences in between how the two operate. Below are listed some of the most notable points on where these rivals diverge.

  5. Summary
    1. Throughout the comparison, it is possible to note how Kubernetes and Docker Swarm fundamentally differ.
    2. Swarm focuses on ease of use with integration with Docker core components while Kubernetes remains open and modular.
    3. The same difference can be noticed while installing and configuring each of the orchestrators.
    4. Docker provides a simple solution that is fast to get started with while Kubernetes aims to support higher demands with higher complexity.
    5. For much of the same reasons, Docker has been popular among developers who prefer simplicity and fast deployments.
    6. At the same time, Kubernetes is used in production environments by many high profile internet companies running popular services

Friday, March 3, 2017

Kafka Multiple Topic

Producer and multiple Topics

    Download a recent stable version of Apache Kafka
  1. Untar the package
  2. Enter into Kafka directory
  3. Start Zookeeper Server
  4. bin/ config/
  5. In a different terminal start Kafka Server
  6. bin/ config/
  7. Create a topic test (if not exists)
  8. bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example2
  9. Create a topic test1 (if not exists)
  10. bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example3
  11. Start consumer on topic test
  12. bin/ --zookeeper localhost:2181 --topic example2 --from-beginning
  13. Start consumer on topic test1
  14. bin/ --zookeeper localhost:2181 --topic example3 --from-beginning
  15. Run mvn clean compile exec:exec
  16. You should see the message in the consumer terminal.
    Here are the project sources
  1. Project POM.xml

  2. Producer Java File
    package com.example.kafka;
    import java.util.Properties;
    import org.apache.kafka.clients.producer.KafkaProducer;
    import org.apache.kafka.clients.producer.ProducerRecord;
    import org.apache.kafka.common.serialization.StringSerializer;
    public class ProducerMultipleTopic {
        public static void main(String[] args) {
            Properties props = new Properties();
            props.put("bootstrap.servers", "localhost:9092");
            props.put("key.serializer", StringSerializer.class.getName());
            props.put("value.serializer", StringSerializer.class.getName());
            KafkaProducer prod = new KafkaProducer(props);
            ProducerRecord data1 = new ProducerRecord("example2", "example2");
            ProducerRecord data2 = new ProducerRecord("example3", "example3");
  3. Logger file
    # Root logger option
    log4j.rootLogger=INFO, stdout
    # Direct log messages to stdout
    log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

  4. This is the PARENT POM

Thursday, December 1, 2016

Application Contianer versus System Container

When people talk about containers, they usually mean application containers. Docker is automatically associated with application containers and is widely used to package applications and services. But there is another type of container: system containers. Let us look at the differences between application containers vs. system containers and see how each type of container is used:

  1. Application Containers 
    • Application/service centric
    • Growing tool ecosystem
    • Security concerns
    • Networking challenges
    • Hampered by base OS limitations 
  1. System Containers

    • Machine-centric
    • Limited tool ecosystem
    • Datacenter-centric
    • Isolated & secure
    • Optimized networking

  • Application Containers

Application containers are used to package applications without launching a virtual machine for each app or each service within an app. They are especially beneficial when making the move to a microservices architecture, as they allow you to create a separate container for each application component and provide greater control, security and process restriction. Ultimately, what you get from application containers is easier distribution. The risks of inconsistency, unreliability and compatibility issues are reduced significantly if an application is placed and shipped inside a container.
Docker is currently the most widely adopted container service provider with a focus on application containers. However, there are other container technologies like CoreOS’s Rocket. Rocket promises better security, portability and flexibility of image sharing. Docker already enjoys the advantage of mass adoption, and Rocket might just be too late to the container party. Even with its differences, Docker is still the unofficial standard for application containers today.

Docker Datacenter enables the deployment of containerized apps across multiple environments, from on-premises to virtual private cloud infrastructure.

With Docker Datacenter you can provide a Containers as a Service (CaaS) environment for your teams.

Deploying Docker Datacenter provides options for container deployment:

  • On-premises. Docker can be deployed to on-premises datacenters.
  • Virtual Private Cloud. Docker can be deployed to virtual private cloud environments including Microsoft Azure and Amazon Web Services.
  • Portability. With Docker, you retain control of where you deploy your app.

As the use of containers increases and organizations deploy them more widely, the need for tools to manage containers across the infrastructure also increases. Orchestrating a cluster of containers is a competitive and rapidly evolving area, and many tools exist offering various feature sets.

Container orchestration tools can be broadly defined as providing an enterprise-level framework for integrating and managing containers at scale. Such tools aim to simplify container management and provide a framework not only for defining initial container deployment but also for managing multiple containers as one entity -- for purposes of availability, scaling, and networking.

Some container orchestration tools to know about include:

  • Amazon ECS -- The Amazon EC2 Container Service (ECS) supports Docker containers and lets you run applications on a managed cluster of Amazon EC2 instances.
  • Azure Container Service (ACS) -- ACS lets you create a cluster of virtual machines that act as container hosts along with master machines that are used to manage your application containers.
  • Cloud Foundry’s Diego -- Diego is a container management system that combines a scheduler, runner, and health manager. It is a rewrite of the Cloud Foundry runtime.
  • CoreOS Fleet -- Fleet is a container management tool that lets you deploy Docker containers on hosts in a cluster as well as distribute services across a cluster.
  • Docker Swarm -- Docker Swarm provides native clustering functionality for Docker containers, which lets you turn a group of Docker engines into a single, virtual Docker engine.
  • Docker Shipyard is a handy tool for people who love Docker Swarm, but wish it did even more. While Swarm focuses on container orchestration through the CLI, Docker Shipyard takes things further by letting you manage app images and container registries in addition to containers themselves. Plus, Shipyard offers a Web-based graphical front-end and a rich API in addition to a CLI.
  • Google Container Engine -- Google Container Engine, which is built on Kubernetes, lets you run Docker containers on the Google Cloud platform. It schedules containers into the cluster and manages them based on user-defined requirements.
  • Kubernetes -- Kubernetes is an orchestration system for Docker containers. It handles scheduling and manages workloads based on user-defined parameters.
  • Mesosphere Marathon -- Marathon is a container orchestration framework for Apache Mesosthat is designed to launch long-running applications. It offers key features for running applications in a clustered environment.

Additionally, the Cloud Native Computing Foundation (CNCF) is focused on integrating the orchestration layer of the container ecosystem. The CNCF’s stated goal is to create and drive adoption of a new set of common container technologies, and it recently selected Google’s Kubernetes container orchestration tool as its first containerization technology.

  • System Containers: How They’re Used

System containers play a similar role to virtual machines, as they share the kernel of the host operating system and provide user space isolation. However, system containers do not use hypervisors. (Any container that runs an OS is a system container.) They also allow you to install different libraries, languages, and databases. Services running in each container use resources that are assigned to just that container.

System containers let you run multiple processes at the same time, all under the same OS and not a separate guest OS. This lowers the performance impact, and provides the benefits of VMs, like running multiple processes, along with the new benefits of containers like better portability and quick startup times.

  • Useful System Container Tools
    • Joyent’s Triton is a Container as a Service that implements its proprietary OS called SmartOS. It not only focuses on packing apps into containers but also provides the benefits of added security, networking and storage, while keeping things lightweight, with very little performance impact. The key differentiator is that Triton delivers bare-metal performance. With Samsung’s recent acquisition of Joyent, it’s left to be seen how Triton progresses.
    • Giant Swarm is a hosted cloud platform that offers a Docker-based virtualization system that is configured for microservices. It helps businesses manage their development stack, spend less time on operations setup, and more time on active development.
    • LXD is a fairly new OS container that was released in 2016 by Canonical, the creators of Ubuntu. It combines the speed and efficiency of containers with the famed security of virtual machines. Since Docker and LXD share the same kernels, it is easy to run Docker containers inside LXD containers.

Friday, May 13, 2016

Web Services best practices

  1. Use XML Schema to define the input and output of your Web Service operations
  2. A Web Service should be defined with a WSDL (or WADL in case of REST) and all responses returned by the Web Service should comply with the advertised WSDL
  3. Do not use a proprietary authentication protocol for your Web Service.
  4. Rather use common standards like HttpAuth or Kerberos.
  5. Or define username/password as part of your XML payload and expose you Web Service via SSL
  6. Make sure your Web Service returns error messages that are useful for debugging/tracking problems.
  7. Make sure to offer a development environment for your service, which preferably runs the same Web Service version as production, but off of a test database rather than production data.
  8. Important to retain
    • Naming conventions
    • parameter validation
    • parameter order
  9. No session data
  10. Resource does not need to be in known state
  11. request alone contains all information
  12. Always include version parameter
  13. Handle multiple formates
  14. Use heartbeat methods
    • method that does nothing with no authentication
    • shows service is alive
  15. All services should be
    • accessible
    • documented
    • robust
    • reliable
    • simple
    • predictable
  16. Always implement a reliability error listener.
  17. Group messages into units of work
  18. Set the acknowledgement interval to a realistic value for your particular scenario.
  19. Set timeouts (inactivity and sequence expiration) to realistic values for your particular scenario.
  20. Configure Web service persistence and buffering (optional) to support asynchronous Web service invocation.
  21. Choose between three transport types: asynchronous client transport, MakeConnection transport, and synchronous transport.
  22. Using WS-Policy to Specify Reliable Messaging Policy Assertions
    • At Most Once
    • At Least Once
    • Exactly Once
    • In Order
  23. Define a logical store for each administrative unit (for example, business unit, department, and so on).
  24. Use the correct logical store for each client or service related to the administrative unit.
  25. Define separate physical stores and buffering queues for each logical store.
  26. Using the @Transactional Annotation
  27. Enabling Web Services Atomic Transactions on Web Services