Skip to content

leottawa/Database-CCF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CCF Spark Project

This repository contains the Connected Component Finder (CCF) project in both Python and Scala, using Spark RDD and DataFrame APIs.

Dataproc

The Python entry points are now cluster-safe:

  • they only force local[*] when no Spark master is already configured
  • they accept --checkpoint-dir
  • experiments.py accepts Dataproc-friendly CLI flags

Set the shared values first:

GCLOUD=./google-cloud-sdk/bin/gcloud
PROJECT_ID=project-7fd33cb0-5f93-49dc-bf4
REGION=europe-west1
CLUSTER=ccf-test
BUCKET=gs://YOUR_BUCKET/ccf

Upload code and data

gsutil -m cp python/*.py "$BUCKET/python/"
gsutil -m cp data/graph_*.txt "$BUCKET/data/"

Run the Python RDD job

$GCLOUD dataproc jobs submit pyspark python/ccf_rdd.py \
  --project="$PROJECT_ID" \
  --region="$REGION" \
  --cluster="$CLUSTER" \
  -- \
  "$BUCKET/data/graph_1m.txt" \
  --checkpoint-dir "$BUCKET/checkpoints/rdd"

Run the Python DataFrame job

$GCLOUD dataproc jobs submit pyspark python/ccf_dataframe.py \
  --project="$PROJECT_ID" \
  --region="$REGION" \
  --cluster="$CLUSTER" \
  -- \
  "$BUCKET/data/graph_1m.txt" \
  --checkpoint-dir "$BUCKET/checkpoints/df"

Run the benchmark script

experiments.py imports ccf_rdd.py and ccf_dataframe.py, so provide them with --py-files when submitting to Dataproc.

$GCLOUD dataproc jobs submit pyspark python/experiments.py \
  --project="$PROJECT_ID" \
  --region="$REGION" \
  --cluster="$CLUSTER" \
  --py-files=python/ccf_rdd.py,python/ccf_dataframe.py \
  -- \
  --environment dataproc \
  --graph-set small \
  --data-dir "$BUCKET/data" \
  --checkpoint-dir "$BUCKET/checkpoints/experiments"

Use --graph-set large once the graph_10m, graph_50m, and graph_200m datasets exist in GCS.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors