docker-jupyter-spark-on-yarn

A convenient docker container to run Jupyter notebooks with Spark on Yarn. Can be found on Docker Hub at https://hub.docker.com/r/fjehl/docker-jupyter-spark-on-yarn/

Build it

Make sure you have docker installed. Then build the image from within the folder.

docker build -t fjehl/docker-jupyter-spark-on-yarn .

Once built, you can run it, once you provide a set of configurations:

You need to supply 3 external volumes for it to run

Hadoop configuration files such as yarn-site.xml to be mounted in the host as /etc/hadoop/conf/*
krb5.conf to be mounted in the guest as /etc/krb5.conf
hive-site.xml to be mounted in the guest as /usr/local/spark/conf/hive-site.xml

The container also needs your Kerberos credentials as environment variables

KRB_LOGIN should be your fully qualified Kerberos login (j.doe@REALM.COM)
KRB_PASSWORD for your password

Finally, to be able to run without --net=host, we require your IP to be exposed to Spark so that the driver is reachable from yarn

DOCKER_HOST_IP should be hence set to the host IP

Then you can run it as follows:

docker run -ti \
  -v <path_containing_hadoop_conf>:/etc/hadoop \
  -v <krb5_conf_file>:/etc/krb5.conf \
  -v <hive_site_xml_file>:/usr/local/spark/conf/hive-site.xml \
  -e DOCKER_HOST_IP=<ip> \
  -e KRB_LOGIN=<user> \
  -e KRB_PASSWORD=<password> \
  fjehl/docker-jupyter-spark-on-yarn

No worries about credential expiration: the docker instance embeds k5start and keeps your ticket active.

Use it

Once the container is started, it will display a link in the console. Click on it, and enter Jupyter. Start a R notebook. You can test all the features (read from hive, train model, write to HDFS, with a script like the following).

# Connect to Yarn
library(sparklyr)
library(dplyr)
require(DBI)
sc <- spark_connect(master = "yarn-client")


# Get data from Hive
dbGetQuery(sc,'USE fjehl')
mtcars_df <- tbl(sc,"mtcars")

# Train a linreg model
mt_cars_partitions <- mtcars_df %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

linreg_model <- mt_cars_partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))

# Graph the results
library(plotly)
result <- select(sdf_predict(linreg_model, mt_cars_partitions$test), wt, cyl, mpg, prediction)
p <- plot_ly(collect(result), x = ~wt, y = ~cyl, z = ~mpg) %>%
  add_markers(name='actual') %>%
  add_markers(z=~prediction, name='prediction') %>%
  layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Cylinder'),
                     zaxis = list(title = 'Gross horsepower')))
embed_notebook(p)

# Write the results in HDFS
spark_write_csv(result, header=FALSE, path="hdfs://root/user/f.jehl/spark_results")
dbGetQuery(sc,"CREATE EXTERNAL TABLE IF NOT EXISTS fjehl.results(
  wt double,
  cyl int,
  mpg double,
  prediction double)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION
  'hdfs://root/user/f.jehl/spark_results'")
dbGetQuery(sc,"SELECT * FROM fjehl.results")

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Dockerfile		Dockerfile
README.MD		README.MD
jupyterd		jupyterd
jupyterd.sv.conf		jupyterd.sv.conf
k5startd		k5startd
k5startd.sv.conf		k5startd.sv.conf
python26.yml		python26.yml
python27.yml		python27.yml
python3.yml		python3.yml
spark-defaults.conf.template		spark-defaults.conf.template
supervisor.yml		supervisor.yml
supervisord		supervisord
supervisord.conf		supervisord.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker-jupyter-spark-on-yarn

Build it

Use it

About

Releases

Packages

Languages

francoisjehl/docker-jupyter-spark-on-yarn

Folders and files

Latest commit

History

Repository files navigation

docker-jupyter-spark-on-yarn

Build it

Use it

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages