DataScience

Posts

Showing posts with the label spark

DATA Scoring : Geting value from BIG Data

- October 20, 2020

Data Scoring is a key component of understanding machine learning model outcomes and choosing the most accurate model that produces the most valuable insights. Once you have a model in production scoring new data, you’ll uncover insights that you can use to create business value. Using the above example, the model scores identify which current customers are at a high risk of churning, enabling you to plan outreach or special offers to prevent that from happening. Model development is generally a two-stage process. The first stage is training and validation, during which you apply algorithms to data for which you know the outcomes to uncover patterns between its features and the target variable. The second stage is scoring, in which you apply the trained model to a new dataset. Then, the model returns outcomes in the form of probability scores for classification problems and estimated averages for regression problems. Finally, you deploy the trained model into a production application o...

KAFKA : Fastest messaging system

- October 20, 2020

Why KAFKA is so fast - 1. Low-Latency I/O: There are two possible places which can be used for storing and caching the data: Random Access Memory (RAM) and Disk . An orthodox way to achieve low latency while delivering messages is to use the RAM. It’s preferred over the disk because disks have high seek-time, thus making them slower. The downside of this approach is that it can be expensive to use the RAM when the data flowing through your system is around 10 to 500 GB per second or even more 2. Kafka Avoids the Seek Time : Yes! Kafka smartly avoids the seek time by using a concept called Sequential I/O . It uses a data structure called ‘log’ which is an append-only sequence of records, ordered by time. The log is basically a queue and it can be appended at its end by the producer and the subscribers can process the messages in their own accord by maintaining pointers. The first record published gets an offset of 0, the second gets an offse...

SCALA : Knowing #TOP10 facts , Why Scala is so popular

- October 16, 2020

Scala is pure object oriented programming language and it had lots of improvement over Java semantics like generics and type casting. It has clean up the issues like diamond relationship or implicit objects to move more closure to functional programming language. Is Scala is better than other programming language? · Lots of focus on using generics like in Arrays and implementation of Any, AnyRef and AnyVal makes it best use to unstructural data processing. · Scala has immutable “val” as a first class language feature. The “val” of scala is similar to Java final variables. we have Var which can be mutable. · Scala lets ‘if blocks’, ‘for-yield loops’, and ‘code’ in braces to return a value. It is more preferable, and eliminates the need for a separate ternary operator. · Singleton has singleton objects rather than C++/Java/ C# classi...

Spark BIGDATA Processing Sample

- October 10, 2020

Java based code on calcuating the word cound - import scala.Tuple2; public class WordCountingApp { public static void main(String[] args) throws InterruptedException { Logger.getLogger("org") .setLevel(Level.OFF); Logger.getLogger("akka") .setLevel(Level.OFF); Map<String, Object> kafkaParams = new HashMap<>(); kafkaParams.put("bootstrap.servers", "localhost:9092"); kafkaParams.put("key.deserializer", StringDeserializer.class); kafkaParams.put("value.deserializer", StringDeserializer.class); kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream"); kafkaParams.put("auto.offset.reset", "latest"); ...

Data Science with BIGDATA

- October 09, 2020

Data Science -