Skip to content

Docker container hosting a Jupyter instance along with with necessary packages to run Spark on Yarn.

Notifications You must be signed in to change notification settings

francoisjehl/docker-jupyter-spark-on-yarn

Repository files navigation

docker-jupyter-spark-on-yarn

A convenient docker container to run Jupyter notebooks with Spark on Yarn. Can be found on Docker Hub at https://hub.docker.com/r/fjehl/docker-jupyter-spark-on-yarn/

Build it

Make sure you have docker installed. Then build the image from within the folder.

docker build -t fjehl/docker-jupyter-spark-on-yarn .

Once built, you can run it, once you provide a set of configurations:

You need to supply 3 external volumes for it to run

  • Hadoop configuration files such as yarn-site.xml to be mounted in the host as /etc/hadoop/conf/*
  • krb5.conf to be mounted in the guest as /etc/krb5.conf
  • hive-site.xml to be mounted in the guest as /usr/local/spark/conf/hive-site.xml

The container also needs your Kerberos credentials as environment variables

  • KRB_LOGIN should be your fully qualified Kerberos login (j.doe@REALM.COM)
  • KRB_PASSWORD for your password

Finally, to be able to run without --net=host, we require your IP to be exposed to Spark so that the driver is reachable from yarn

  • DOCKER_HOST_IP should be hence set to the host IP

Then you can run it as follows:

docker run -ti \
  -v <path_containing_hadoop_conf>:/etc/hadoop \
  -v <krb5_conf_file>:/etc/krb5.conf \
  -v <hive_site_xml_file>:/usr/local/spark/conf/hive-site.xml \
  -e DOCKER_HOST_IP=<ip> \
  -e KRB_LOGIN=<user> \
  -e KRB_PASSWORD=<password> \
  fjehl/docker-jupyter-spark-on-yarn

No worries about credential expiration: the docker instance embeds k5start and keeps your ticket active.

Use it

Once the container is started, it will display a link in the console. Click on it, and enter Jupyter. Start a R notebook. You can test all the features (read from hive, train model, write to HDFS, with a script like the following).

# Connect to Yarn
library(sparklyr)
library(dplyr)
require(DBI)
sc <- spark_connect(master = "yarn-client")


# Get data from Hive
dbGetQuery(sc,'USE fjehl')
mtcars_df <- tbl(sc,"mtcars")

# Train a linreg model
mt_cars_partitions <- mtcars_df %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

linreg_model <- mt_cars_partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))

# Graph the results
library(plotly)
result <- select(sdf_predict(linreg_model, mt_cars_partitions$test), wt, cyl, mpg, prediction)
p <- plot_ly(collect(result), x = ~wt, y = ~cyl, z = ~mpg) %>%
  add_markers(name='actual') %>%
  add_markers(z=~prediction, name='prediction') %>%
  layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Cylinder'),
                     zaxis = list(title = 'Gross horsepower')))
embed_notebook(p)

# Write the results in HDFS
spark_write_csv(result, header=FALSE, path="hdfs://root/user/f.jehl/spark_results")
dbGetQuery(sc,"CREATE EXTERNAL TABLE IF NOT EXISTS fjehl.results(
  wt double,
  cyl int,
  mpg double,
  prediction double)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION
  'hdfs://root/user/f.jehl/spark_results'")
dbGetQuery(sc,"SELECT * FROM fjehl.results")

About

Docker container hosting a Jupyter instance along with with necessary packages to run Spark on Yarn.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages