DataScience

Posts

CI\CD : In world of Dockers

- November 24, 2020

Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. You can even run Docker containers within host machines that are actually virtual machines Kubernetes and OpenShift are both container orchestration platforms, but they have some important differences: 1. OpenShift is a Kubernetes distribution with additional features: OpenShift is built on top of Kubernetes and adds several features and capabilities that Kubernetes does not have out of the box. Some of these features include integrated container registry, integrated CI/CD pipelines, integrated monitoring, and integrated security features. 2. OpenShift has a more opinionated approach: OpenShift is designed to provide a more opinionated and integrated platform for building and deploying containerized applications. This means that OpenShift has a more prescriptive ...

SCALA : Knowing #TOP10 Basics, why it is so functional

- October 27, 2020

1) Type of Valriables Variables are simply a storage location. Every variable is known by its name and stores some known and unknown piece of information known as value. So one can define a variable by its data type and name, a data type is responsible for allocating memory for the variable. In SCALA there are two types of variable: Mutable Variables var Variable_name: Data_type = "value" Immutable Variables val Variable_name: Data_type = "value" package com.test.util import scala.util.control.Breaks._ object BasicSample { def main(args: Array[String]) { var obj = new BasicClass(); obj.show() obj.valueDemo() obj.loopDemo() } } class BasicClass { var name = "test name" var age = 10; def show() { ...

HADOOP \ SPARK ECO System

- October 26, 2020

The Big data eco system is build with multiple tools and systems. 1) What is difference between sqoop and flume? Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity. 2) What is Hive? Hive is a data warehouse infrastructure tool that processes structured data in Hadoop. It is proced by FB to help data anlytics \ RDBMS people to directly query Hadoop cluster. Most data warehousing applications work with SQL-based querying language. Features of Hive. It accelerates queries as it provides indexes, including bitmap indexes.

Spark Cluster

- October 26, 2020

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program ). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run. There are several useful things to note about this architecture: Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from ...

Data Mining : Handling large Data set

- October 25, 2020

Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more. Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It can be used in a variety of ways, such as database marketing, credit risk management, fraud detection, spam Email filtering, or even to discern the sentiment or opinion of users. Type - Data mining has several types, including pictorial data mining, text mining, social media mining, web mining, and audio and video mining amongst others. Another example of Data Mining an...

DATA Scoring : Geting value from BIG Data

- October 20, 2020

Data Scoring is a key component of understanding machine learning model outcomes and choosing the most accurate model that produces the most valuable insights. Once you have a model in production scoring new data, you’ll uncover insights that you can use to create business value. Using the above example, the model scores identify which current customers are at a high risk of churning, enabling you to plan outreach or special offers to prevent that from happening. Model development is generally a two-stage process. The first stage is training and validation, during which you apply algorithms to data for which you know the outcomes to uncover patterns between its features and the target variable. The second stage is scoring, in which you apply the trained model to a new dataset. Then, the model returns outcomes in the form of probability scores for classification problems and estimated averages for regression problems. Finally, you deploy the trained model into a production application o...

KAFKA : Fastest messaging system

- October 20, 2020

Why KAFKA is so fast - 1. Low-Latency I/O: There are two possible places which can be used for storing and caching the data: Random Access Memory (RAM) and Disk . An orthodox way to achieve low latency while delivering messages is to use the RAM. It’s preferred over the disk because disks have high seek-time, thus making them slower. The downside of this approach is that it can be expensive to use the RAM when the data flowing through your system is around 10 to 500 GB per second or even more 2. Kafka Avoids the Seek Time : Yes! Kafka smartly avoids the seek time by using a concept called Sequential I/O . It uses a data structure called ‘log’ which is an append-only sequence of records, ordered by time. The log is basically a queue and it can be appended at its end by the producer and the subscribers can process the messages in their own accord by maintaining pointers. The first record published gets an offset of 0, the second gets an offse...