CCF Spark Project

This repository contains the Connected Component Finder (CCF) project in both Python and Scala, using Spark RDD and DataFrame APIs.

Dataproc

The Python entry points are now cluster-safe:

they only force local[*] when no Spark master is already configured
they accept --checkpoint-dir
experiments.py accepts Dataproc-friendly CLI flags

Set the shared values first:

GCLOUD=./google-cloud-sdk/bin/gcloud
PROJECT_ID=project-7fd33cb0-5f93-49dc-bf4
REGION=europe-west1
CLUSTER=ccf-test
BUCKET=gs://YOUR_BUCKET/ccf

Upload code and data

gsutil -m cp python/*.py "$BUCKET/python/"
gsutil -m cp data/graph_*.txt "$BUCKET/data/"

Run the Python RDD job

$GCLOUD dataproc jobs submit pyspark python/ccf_rdd.py \
  --project="$PROJECT_ID" \
  --region="$REGION" \
  --cluster="$CLUSTER" \
  -- \
  "$BUCKET/data/graph_1m.txt" \
  --checkpoint-dir "$BUCKET/checkpoints/rdd"

Run the Python DataFrame job

$GCLOUD dataproc jobs submit pyspark python/ccf_dataframe.py \
  --project="$PROJECT_ID" \
  --region="$REGION" \
  --cluster="$CLUSTER" \
  -- \
  "$BUCKET/data/graph_1m.txt" \
  --checkpoint-dir "$BUCKET/checkpoints/df"

Run the benchmark script

experiments.py imports ccf_rdd.py and ccf_dataframe.py, so provide them with --py-files when submitting to Dataproc.

$GCLOUD dataproc jobs submit pyspark python/experiments.py \
  --project="$PROJECT_ID" \
  --region="$REGION" \
  --cluster="$CLUSTER" \
  --py-files=python/ccf_rdd.py,python/ccf_dataframe.py \
  -- \
  --environment dataproc \
  --graph-set small \
  --data-dir "$BUCKET/data" \
  --checkpoint-dir "$BUCKET/checkpoints/experiments"

Use --graph-set large once the graph_10m, graph_50m, and graph_200m datasets exist in GCS.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
python		python
results		results
scala		scala
.gitignore		.gitignore
ProjectIndications.pdf		ProjectIndications.pdf
README.md		README.md
ccf.pdf		ccf.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCF Spark Project

Dataproc

Upload code and data

Run the Python RDD job

Run the Python DataFrame job

Run the benchmark script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CCF Spark Project

Dataproc

Upload code and data

Run the Python RDD job

Run the Python DataFrame job

Run the benchmark script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages