Crawling for names

Welcome to my Insight data engineering project

Written over the course of three weeks, this project has two goals. The first is to collect a count on the number of web pages that mention some of the presidential front-runners as of January/February 2016. The program uses data from the Common Crawl, which consists of more than 150 terabytes of data scrapped from 1.2 billion URIs during November 2015.

The task itself is a simple exercise in counting but the sheer size of the data and some of their characteristics -- multi-line record format spread across 37,500 files -- presented the biggest engineering challenge. Tools and language used were Scala, Spark, HBase and Python/Flask scripts.

In addition to this repository of code, there are important configuration files to update. Of particular note is the spark-defaults.conf file which is maintained at the root of this repository but actually resides on the master node in the directory: /usr/local/spark/conf

For an overview of this project, refer to the presentation: http://www.slideshare.net/HoaNguyen317/insight-data-engineering-project

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
scripts		scripts
src/main/scala		src/main/scala
summarizeCrawl		summarizeCrawl
README.md		README.md
crawlnames.sbt		crawlnames.sbt
spark-defaults.conf		spark-defaults.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawling for names

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawling for names

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages