Learning Spark

- May 17, 2020

Apache Spark is one of the most famous library for big data processing engine. It is a framework for real time data analytics in a distributed computing environment. The Spark is written in Scala and was originally developed at the University of California, Berkeley. It executes in-memory computations to increase speed of data processing over Map-Reduce.It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other optimizations. Therefore, it requires high processing power than Map-Reduce.

Spark comes with high-level libraries which including support for R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless integrations in complex workflow. Over this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc to increase its capabilities.

RDD is a fundamental data structure of Spark.

■ It is an immutable distributed collection of objects that can be stored in memory or disk across a cluster.

■ Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

■ Parallel functional transformations (map, filter, …).

■ Automatically rebuilt on failure.

■ RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Running jobs in Spark Cluster -

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

Search This Blog

DataScience

Learning Spark

Comments

Post a Comment

Popular posts from this blog

Spark Cluster

DORA Metrics

Data Science with BIGDATA