Posts

CI\CD : In world of Dockers

Image
Docker  provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. You can even run  Docker  containers within host machines that are actually virtual machines Kubernetes and OpenShift are both container orchestration platforms, but they have some important differences: 1. OpenShift is a Kubernetes distribution with additional features: OpenShift is built on top of Kubernetes and adds several features and capabilities that Kubernetes does not have out of the box. Some of these features include integrated container registry, integrated CI/CD pipelines, integrated monitoring, and integrated security features. 2. OpenShift has a more opinionated approach: OpenShift is designed to provide a more opinionated and integrated platform for building and deploying containerized applications. This means that OpenShift has a more prescriptive ...

SCALA : Knowing #TOP10 Basics, why it is so functional

1) Type of Valriables  Variables are simply a storage location. Every variable is known by its name and stores some known and unknown piece of information known as value. So one can define a variable by its data type and name, a data type is responsible for allocating memory for the variable. In SCALA there are two types of variable: Mutable Variables                         var Variable_name: Data_type = "value"     Immutable Variables                         val Variable_name: Data_type = "value"   package com.test.util import scala.util.control.Breaks._ object BasicSample {   def main(args: Array[String]) {     var obj = new BasicClass();     obj.show()     obj.valueDemo()     obj.loopDemo()   } } class BasicClass {   var name = "test name"   var age = 10;   def show() {   ...

HADOOP \ SPARK ECO System

Image
The Big data eco system is build with multiple tools and systems. 1) What is difference between sqoop and flume? Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity. 2) What is  Hive? Hive is a data warehouse infrastructure tool that processes structured data in Hadoop. It is proced by FB to help data anlytics \ RDBMS people to directly query Hadoop cluster. Most data warehousing applications work with SQL-based querying language. Features of Hive. It accelerates queries as it provides indexes, including bitmap indexes.

Spark Cluster

Image
Spark applications run as independent sets of processes on a cluster, coordinated by the   SparkContext   object in your main program (called the   driver program ). Specifically, to run on a cluster, the SparkContext can connect to several types of  cluster managers  (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires  executors  on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends  tasks  to the executors to run. There are several useful things to note about this architecture: Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from ...

Data Mining : Handling large Data set

Image
Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems Data mining  is the process of finding anomalies, patterns and correlations within large  data  sets to predict outcomes. Using a broad range of techniques, you can  use  this information to increase revenues, cut costs, improve customer relationships, reduce risks and more. Data mining  involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It can be used in a variety of ways, such as database marketing, credit risk management, fraud detection, spam Email filtering, or even to discern the sentiment or opinion of users. Type -  Data mining has several types, including pictorial data mining, text mining, social media mining, web mining, and audio and video mining amongst others.  Another  example  of  Data Mining  an...

DATA Scoring : Geting value from BIG Data

Data Scoring is a key component of understanding machine learning model outcomes and choosing the most accurate model that produces the most valuable insights. Once you have a model in production scoring new data, you’ll uncover insights that you can use to create business value. Using the above example, the model scores identify which current customers are at a high risk of churning, enabling you to plan outreach or special offers to prevent that from happening. Model development is generally a two-stage process. The first stage is training and validation, during which you apply algorithms to data for which you know the outcomes to uncover patterns between its features and the target variable. The second stage is scoring, in which you apply the trained model to a new dataset. Then, the model returns outcomes in the form of probability scores for classification problems and estimated averages for regression problems. Finally, you deploy the trained model into a production application o...

KAFKA : Fastest messaging system

Image
Why KAFKA is so fast -    1. Low-Latency I/O:   There are two possible places which can be used for storing and caching the data:   Random Access Memory (RAM)   and   Disk . An orthodox way to achieve low latency while delivering messages is to use the RAM. It’s preferred over the disk because disks have high seek-time, thus making them slower. The downside of this approach is that it can be expensive to use the RAM when the data flowing through your system is around 10 to 500 GB per second or even more 2. Kafka Avoids the Seek Time : Yes! Kafka smartly avoids the seek time by using a concept called  Sequential I/O . It uses a data structure called ‘log’ which is an append-only sequence of records, ordered by time. The log is basically a queue and it can be appended at its end by the producer and the subscribers can process the messages in their own accord by maintaining pointers. The first record published gets an offset of 0, the second gets an offse...