My Quotes


When U were born , you cried and the world rejoiced
Live U'r life in such a way that when you go
THE WORLD SHOULD CRY






Tuesday, April 4, 2017

Fetch TWITTER data using R

Fetch Twitter data using R


  1. twitteR Package:
    1. One of the available package in R for fetching Twitter Data. The package can be obtained from CRAN.R.PROJECT
    2. This package allows us to make REST API calls to twitter using the ConsumerKey & ConsumerSecret code. Code below illustrates
      how to extract the Twitter Data.
    3. This package offers below functionality:
      1. Authenticate with Twitter API
      2. Fetch User timeline
      3. User Followers
      4. User Mentions
      5. Search twitter
      6. User Information
      7. User Trends
      8. Convert JSON object to dataframes
  2. REST API CALLS using R - twitteR package:
    1. Register your application with twitter.
    2. After registration, you will be getting ConsumerKey & ConsumerSecret code which needs to be used for calling twitter API.
    3. Load TwitteR library in R environment.
    4. Call twitter API using OAuthFactory$new() method with ConsumerKey & ConsumerSecret code as input params.
    5. The above step will return an authorization link, which needs to be copied & pasted in the internet browser.
    6. You will be redirected to Twitter application authentication page where you need to authenticate yourself by providing you twitter credentials.
    7. After authenticating , we will be provided with a Authorization code, which needs to be pasted in the R console.
    8. Call registerTwitterOAuth().
    9. friends information
    10. Location based
  3. Source Code:
    library(twitteR)
    requestURL <-  "https://api.twitter.com/oauth/request_token"
    accessURL =    "https://api.twitter.com/oauth/access_token"
    authURL =      "https://api.twitter.com/oauth/authorize"
    consumerKey =   "XXXXXXXXXXXX"
    consumerSecret = "XXXXXXXXXXXXXXXX"
    twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                                 consumerSecret=consumerSecret,
                                 requestURL=requestURL,
                                 accessURL=accessURL,
                                 authURL=authURL)
    download.file(url="http://curl.haxx.se/ca/cacert.pem",
                  destfile="cacert.pem")
    twitCred$handshake(cainfo="cacert.pem")
    save(list="twitCred", file="twitteR_credentials")
    load("twitteR_credentials")
    registerTwitterOAuth(twitCred)#Register your app with Twitter.
    
    
  4. StreamR Package:
    1. This package allows users to fetch twitter Data in real time by connecting to Twitter Stream API.
    2. We can obtain the package from STREAM.R.PROJECT
    3. Few important functions this package offers are: it allows R users to access Twitter's search streams,user streams, parse the output into data frames.
    4. filterStream() - filterStream method opens a connection to Twitter’s Streaming API that will return public statuses that match one or more filter predicates like search keywords.
    5. Tweets can be filtered by keywords, users, language, and location.
    6. The output can be saved as an object in memory or written to a text file.
    7. parseTweets() - This function parses tweets downloaded using filterStream, sampleStream or userStream and returns a data frame.
  5. Below code example shows how to fetch data in real time using RStream:
    library(streamR)
    library(twitteR)
    load("twitteR_credentials")  # make using the save credentials in the previous code.
    registerTwitterOAuth(twitCred)
    filterStream(file.name = "tweets.json", track = "#bigdata",timeout = 0, locations=c(-74,40,-73,41), oauth = twitCred)
    Executing the above will capturing Tweets on "#bigdata" from "NEW YORK" location. Here when we mention timeout=0, we are setting it to fetch continuously, to fetch records for certain time then use timeout=300 (fetches data for 300 secs)
    To Parse the fetched tweets use the below code:
    tweets.df <- parseTweets("tweets.json")
    
    

Tuesday, March 21, 2017

Technology stack of BigData

Docker versus Kubernetes - comparison

  1. Introduction
    1. Containers have become popular thanks to their focus on consistency across platforms from development to production.
    2. The rise in interest to containers has in turn brought in higher demands for their deployment and management.
    3. The need for better control attracted a number of software options as solutions for container orchestration, which allows for abstraction of individual containers to services with a number of instances or replicas.
    4. Two of the major players developing container orchestration are Docker and Kubernetes.
  2. Kubernetes
    1. Kubernetes is an open-source platform for container deployment automation, scaling, and operations across clusters of hosts. The production ready orchestrator draws on Google’s
      extensive experience of years of working with Linux containers.
    2. Kubernetes aims to provide the components and tools to relieve the burden of running applications in public and private clouds by grouping containers into logical units. Their strengths lie in flexible growth, environment agnostic portability, and easy scaling.
  3. Docker Swarm
    1. Swarm is the native clustering for Docker. Originally Docker Swarm did not provide much in the sense of container automation, but with the update to Docker Engine 1.12, container orchestration is now built into its core with first party support.
    2. Docker Swarm is designed around four core principles:
      1. Simple yet powerful with a “just works” user experience,
      2. Resilient zero single-point-of-failure architecture,
      3. Secure by default with automatically generated certificates, and
      4. Backwards compatibility with existing components.
    3. The promise of backwards compatibility is especially important to the existing users. Any tools or containers that work with Docker run equally well in Docker Swarm.
  4. Comparisons
    Although both orchestrators provide much of the same functionality to one another, there are fundamental differences in between how the two operate. Below are listed some of the most notable points on where these rivals diverge.


  5. Summary
    1. Throughout the comparison, it is possible to note how Kubernetes and Docker Swarm fundamentally differ.
    2. Swarm focuses on ease of use with integration with Docker core components while Kubernetes remains open and modular.
    3. The same difference can be noticed while installing and configuring each of the orchestrators.
    4. Docker provides a simple solution that is fast to get started with while Kubernetes aims to support higher demands with higher complexity.
    5. For much of the same reasons, Docker has been popular among developers who prefer simplicity and fast deployments.
    6. At the same time, Kubernetes is used in production environments by many high profile internet companies running popular services

Friday, March 3, 2017

Kafka Multiple Topic

Producer and multiple Topics

    Download a recent stable version of Apache Kafka
  1. Untar the package
  2. Enter into Kafka directory
  3. Start Zookeeper Server
  4. bin/zookeeper-server-start.sh config/zookeeper.properties
  5. In a different terminal start Kafka Server
  6. bin/kafka-server-start.sh config/server.properties
  7. Create a topic test (if not exists)
  8. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example2
  9. Create a topic test1 (if not exists)
  10. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example3
  11. Start consumer on topic test
  12. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic example2 --from-beginning
  13. Start consumer on topic test1
  14. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic example3 --from-beginning
  15. Run mvn clean compile exec:exec
  16. You should see the message in the consumer terminal.
    Here are the project sources
  1. Project POM.xml
    
    
      4.0.0
     
      ../pom.xml
      1.0.0-SNAPSHOT
      com.example.kafka
      kafka-examples
     
      kafka-producer-multiple-topics
      kafka-producer-multiple-topics
    
     
      
       
        org.codehaus.mojo
        exec-maven-plugin
        
         
          
           exec
          
         
        
        
         java
         
          -classpath
          
          com.example.kafka.ProducerMultipleTopic
         
        
       
      
     
    
    
    

  2. Producer Java File
    package com.example.kafka;
    
    import java.util.Properties;
    
    import org.apache.kafka.clients.producer.KafkaProducer;
    import org.apache.kafka.clients.producer.ProducerRecord;
    import org.apache.kafka.common.serialization.StringSerializer;
    
    public class ProducerMultipleTopic {
    
        public static void main(String[] args) {
    
            Properties props = new Properties();
            props.put("bootstrap.servers", "localhost:9092");
            props.put("key.serializer", StringSerializer.class.getName());
            props.put("value.serializer", StringSerializer.class.getName());
            
            KafkaProducer prod = new KafkaProducer(props);
    
            ProducerRecord data1 = new ProducerRecord("example2", "example2");
            ProducerRecord data2 = new ProducerRecord("example3", "example3");
    
            prod.send(data1);
            prod.send(data2);
    
            prod.close();
        }
    }
    
  3. Logger file
    # Root logger option
    log4j.rootLogger=INFO, stdout
    
    # Direct log messages to stdout
    log4j.appender.stdout=org.apache.log4j.ConsoleAppender
    log4j.appender.stdout.Target=System.out
    log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
    log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
    
    


  4. This is the PARENT POM
    
    
     4.0.0
     com.example.kafka
     kafka-examples
     1.0.0-SNAPSHOT
     pom
    
     
      
       org.apache.kafka
       kafka_2.10
       0.9.0.1
      
      
       org.apache.kafka
       kafka-clients
       0.9.0.1
      
      
       log4j
       log4j
       1.2.17
      
     
    
     
      UTF-8
     
    
     
      
       
        
         org.apache.maven.plugins
         maven-compiler-plugin
         3.2
         
          1.7
          1.7
         
        
        
         org.codehaus.mojo
         exec-maven-plugin
         1.3.2
        
       
      
     
     
       kafka-producer-multiple-topics
      
    
    
    

Thursday, December 1, 2016

Application Contianer versus System Container



When people talk about containers, they usually mean application containers. Docker is automatically associated with application containers and is widely used to package applications and services. But there is another type of container: system containers. Let us look at the differences between application containers vs. system containers and see how each type of container is used:


  1. Application Containers 
    • Application/service centric
    • Growing tool ecosystem
    • Security concerns
    • Networking challenges
    • Hampered by base OS limitations 
  1. System Containers


    • Machine-centric
    • Limited tool ecosystem
    • Datacenter-centric
    • Isolated & secure
    • Optimized networking



  • Application Containers

Application containers are used to package applications without launching a virtual machine for each app or each service within an app. They are especially beneficial when making the move to a microservices architecture, as they allow you to create a separate container for each application component and provide greater control, security and process restriction. Ultimately, what you get from application containers is easier distribution. The risks of inconsistency, unreliability and compatibility issues are reduced significantly if an application is placed and shipped inside a container.
Docker is currently the most widely adopted container service provider with a focus on application containers. However, there are other container technologies like CoreOS’s Rocket. Rocket promises better security, portability and flexibility of image sharing. Docker already enjoys the advantage of mass adoption, and Rocket might just be too late to the container party. Even with its differences, Docker is still the unofficial standard for application containers today.

Docker Datacenter enables the deployment of containerized apps across multiple environments, from on-premises to virtual private cloud infrastructure.

With Docker Datacenter you can provide a Containers as a Service (CaaS) environment for your teams.

Deploying Docker Datacenter provides options for container deployment:

  • On-premises. Docker can be deployed to on-premises datacenters.
  • Virtual Private Cloud. Docker can be deployed to virtual private cloud environments including Microsoft Azure and Amazon Web Services.
  • Portability. With Docker, you retain control of where you deploy your app.

As the use of containers increases and organizations deploy them more widely, the need for tools to manage containers across the infrastructure also increases. Orchestrating a cluster of containers is a competitive and rapidly evolving area, and many tools exist offering various feature sets.

Container orchestration tools can be broadly defined as providing an enterprise-level framework for integrating and managing containers at scale. Such tools aim to simplify container management and provide a framework not only for defining initial container deployment but also for managing multiple containers as one entity -- for purposes of availability, scaling, and networking.

Some container orchestration tools to know about include:

  • Amazon ECS -- The Amazon EC2 Container Service (ECS) supports Docker containers and lets you run applications on a managed cluster of Amazon EC2 instances.
  • Azure Container Service (ACS) -- ACS lets you create a cluster of virtual machines that act as container hosts along with master machines that are used to manage your application containers.
  • Cloud Foundry’s Diego -- Diego is a container management system that combines a scheduler, runner, and health manager. It is a rewrite of the Cloud Foundry runtime.
  • CoreOS Fleet -- Fleet is a container management tool that lets you deploy Docker containers on hosts in a cluster as well as distribute services across a cluster.
  • Docker Swarm -- Docker Swarm provides native clustering functionality for Docker containers, which lets you turn a group of Docker engines into a single, virtual Docker engine.
  • Docker Shipyard is a handy tool for people who love Docker Swarm, but wish it did even more. While Swarm focuses on container orchestration through the CLI, Docker Shipyard takes things further by letting you manage app images and container registries in addition to containers themselves. Plus, Shipyard offers a Web-based graphical front-end and a rich API in addition to a CLI.
  • Google Container Engine -- Google Container Engine, which is built on Kubernetes, lets you run Docker containers on the Google Cloud platform. It schedules containers into the cluster and manages them based on user-defined requirements.
  • Kubernetes -- Kubernetes is an orchestration system for Docker containers. It handles scheduling and manages workloads based on user-defined parameters.
  • Mesosphere Marathon -- Marathon is a container orchestration framework for Apache Mesosthat is designed to launch long-running applications. It offers key features for running applications in a clustered environment.

Additionally, the Cloud Native Computing Foundation (CNCF) is focused on integrating the orchestration layer of the container ecosystem. The CNCF’s stated goal is to create and drive adoption of a new set of common container technologies, and it recently selected Google’s Kubernetes container orchestration tool as its first containerization technology.

  • System Containers: How They’re Used

System containers play a similar role to virtual machines, as they share the kernel of the host operating system and provide user space isolation. However, system containers do not use hypervisors. (Any container that runs an OS is a system container.) They also allow you to install different libraries, languages, and databases. Services running in each container use resources that are assigned to just that container.

System containers let you run multiple processes at the same time, all under the same OS and not a separate guest OS. This lowers the performance impact, and provides the benefits of VMs, like running multiple processes, along with the new benefits of containers like better portability and quick startup times.


  • Useful System Container Tools
    • Joyent’s Triton is a Container as a Service that implements its proprietary OS called SmartOS. It not only focuses on packing apps into containers but also provides the benefits of added security, networking and storage, while keeping things lightweight, with very little performance impact. The key differentiator is that Triton delivers bare-metal performance. With Samsung’s recent acquisition of Joyent, it’s left to be seen how Triton progresses.
    • Giant Swarm is a hosted cloud platform that offers a Docker-based virtualization system that is configured for microservices. It helps businesses manage their development stack, spend less time on operations setup, and more time on active development.
    • LXD is a fairly new OS container that was released in 2016 by Canonical, the creators of Ubuntu. It combines the speed and efficiency of containers with the famed security of virtual machines. Since Docker and LXD share the same kernels, it is easy to run Docker containers inside LXD containers.


Friday, May 13, 2016

Web Services best practices

  1. Use XML Schema to define the input and output of your Web Service operations
  2. A Web Service should be defined with a WSDL (or WADL in case of REST) and all responses returned by the Web Service should comply with the advertised WSDL
  3. Do not use a proprietary authentication protocol for your Web Service.
  4. Rather use common standards like HttpAuth or Kerberos.
  5. Or define username/password as part of your XML payload and expose you Web Service via SSL
  6. Make sure your Web Service returns error messages that are useful for debugging/tracking problems.
  7. Make sure to offer a development environment for your service, which preferably runs the same Web Service version as production, but off of a test database rather than production data.
  8. Important to retain
    • Naming conventions
    • parameter validation
    • parameter order
  9. No session data
  10. Resource does not need to be in known state
  11. request alone contains all information
  12. Always include version parameter
  13. Handle multiple formates
  14. Use heartbeat methods
    • method that does nothing with no authentication
    • shows service is alive
  15. All services should be
    • accessible
    • documented
    • robust
    • reliable
    • simple
    • predictable
  16. Always implement a reliability error listener.
  17. Group messages into units of work
  18. Set the acknowledgement interval to a realistic value for your particular scenario.
  19. Set timeouts (inactivity and sequence expiration) to realistic values for your particular scenario.
  20. Configure Web service persistence and buffering (optional) to support asynchronous Web service invocation.
  21. Choose between three transport types: asynchronous client transport, MakeConnection transport, and synchronous transport.
  22. Using WS-Policy to Specify Reliable Messaging Policy Assertions
    • At Most Once
    • At Least Once
    • Exactly Once
    • In Order
  23. Define a logical store for each administrative unit (for example, business unit, department, and so on).
  24. Use the correct logical store for each client or service related to the administrative unit.
  25. Define separate physical stores and buffering queues for each logical store.
  26. Using the @Transactional Annotation
  27. Enabling Web Services Atomic Transactions on Web Services

Thursday, May 5, 2016

Hibernate best practices

  1. Prefer crawling the object model over running queries
    • Querying in hibernate always causes a flush
  2. Make everying lazy
    • First read will be slow but everything else will be cached
  3. Use second level cache
  4. Use cascade cautiously
    • Hibernate is not good at saving a whole object tree in one go
  5. Use Field access over method access
    • will be faster since no relfection is used
  6. Use instrumentation
  7. Don't use auto generated Keys
    • you have to wait until the object is persisted before its equals method works
  8. Use id based Equality
  9. Write fine-grained classes and map them using .
    • Use an Address class to encapsulate street, suburb, state, postcode. This encourages code reuse and simplifies refactoring.
  10. Declare identifier properties on persistent classes.
    • Hibernate makes identifier properties optional. There are all sorts of reasons why you should use them. We recommend that identifiers be 'synthetic' (generated, with no business meaning) and of a non-primitive type. For maximum flexibility, use java.lang.Long or java.lang.String.
  11. Place each class mapping in its own file.
    • Don't use a single monolithic mapping document. Map com.eg.Foo in the file com/eg/Foo.hbm.xml. This makes particularly good sense in a team environment.
  12. Load mappings as resources.
    • Deploy the mappings along with the classes they map.
  13. Consider externalising query strings.
    • Externalising the query strings to mapping files will make the application more portable.
  14. Use bind variables.
    • Even better, consider using named parameters in queries.
  15. Don't manage your own JDBC connections.
    • Hibernate lets the application manage JDBC connections. This approach should be considered a last-resort.
    • If you can't use the built-in connections providers, consider providing your own implementation of net.sf.hibernate.connection.ConnectionProvider.
  16. Consider using a custom type.
    • Suppose you have a Java type, say from some library, that needs to be persisted but doesn't provide the accessors needed to map it as a component.
    • You should consider implementing net.sf.hibernate.UserType.
    • This approach frees the application code from implementing transformations to / from a Hibernate type.
  17. Understand Session flushing.
    • From time to time the Session synchronizes its persistent state with the database.
    • Performance will be affected if this process occurs too often.
    • You may sometimes minimize unnecessary flushing by disabling automatic flushing or even by changing the order of queries and other operations
      within a particular transaction.
  18. In a three tiered architecture, consider using saveOrUpdate().
    • When using a servlet / session bean architecture, you could pass persistent objects loaded in the session bean to and from the servlet / JSP layer.
    • Use a new session to service each request. Use Session.update() or Session.saveOrUpdate() to update the persistent state of an object.
  19. In a two tiered architecture, consider using session disconnection.
    • Database Transactions have to be as short as possible for best scalability.
    • This Application Transaction might span several client requests and response cycles.
    • Either use Detached Objects or, in two tiered architectures, simply disconnect the Hibernate Session from the JDBC connection and reconnect
      it for each subsequent request.
    • Never use a single Session for more than one Application Transaction usecase, otherwise, you will run into stale data.
  20. Don't treat exceptions as recoverable.
    • This is more of a necessary practice than a "best" practice.
    • When an exception occurs, roll back the Transaction and close the Session.
    • If you don't, Hibernate can't guarantee that in-memory state accurately represents persistent state.
    • As a special case of this, do not use Session.load() to determine if an instance with the given identifier exists on the database;
    • use find() instead.
  21. Prefer lazy fetching for associations.
    • Use eager (outer-join) fetching sparingly.
    • Use proxies and/or lazy collections for most associations to classes that are not cached at the JVM-level.
    • For associations to cached classes, where there is a high probability of a cache hit, explicitly disable eager fetching using outer-join="false".
    • When an outer-join fetch is appropriate to a particular use case, use a query with a left join fetch.
  22. Consider abstracting your business logic from Hibernate.
    • Hide (Hibernate) data-access code behind an interface.
    • Combine the DAO and Thread Local Session patterns.
    • You can even have some classes persisted by handcoded JDBC, associated to Hibernate via a UserType.
  23. Implement equals() and hashCode() using a unique business key.
    • If you compare objects outside of the Session scope, you have to implement equals() and hashCode().
    • If you implement these methods, never ever use the database identifier!
    • To implement equals() and hashCode(), use a unique business key, that is, compare a unique combination of class properties.
    • Never use collections in the equals() comparison (lazy loading) and be careful with other associated classes that might be proxied.
  24. Don't use exotic association mappings.
    • Good usecases for a real many-to-many associations are rare.
    • Most of the time you need additional information stored in the "link table".
    • In this case, it is much better to use two one-to-many associations to an intermediate link class.
    • In fact, we think that most associations are one-to-many and many-to-one, you should be careful when using any other
    • association style and ask yourself if it is really neccessary.