"Using Cascalog to build an app with City of Palo Alto Open Data" by Paco Nathan, presented at OSCON 2013 in Portland. Based on a case study from "Enterprise Data Workflows with Cascading" http://shop.oreilly.com/product/0636920028536.do
1 of 62
More Related Content
Using Cascalog to build an app with City of Palo Alto Open Data
1. Using Cascalog
to build an app with
City of Palo Alto
Open Data
Paco Nathan
http://liber118.com/pxn/
1Sunday, 28 July 13
2. GitHub repo for the open source project:
github.com/Cascading/CoPA/wiki
This project began as a Big Data workshop
for a graduate seminar at CMU West
Many thanks to:
Stuart Evans
CMU Distinguished Service Professor
Jonathan Reichental
City of Palo Alto CIO
Peter Pirnejad
City of Palo Alto Dev Center Director
Diego May
Junar CEO & Co-founder
2Sunday, 28 July 13
3. Cascading, a workflow abstraction
Cascalog ➟ 2.0
Palo Alto case study
Open Data insights
3Sunday, 28 July 13
4. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
4Sunday, 28 July 13
5. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
•leverages JVM and Java-based tools without any
need to create new languages
•allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
5Sunday, 28 July 13
6. Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly based
on domain-specific languages (DSLs) in JVM languages
which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
6Sunday, 28 July 13
8. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
8Sunday, 28 July 13
9. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
workflow abstraction addresses:
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
9Sunday, 28 July 13
10. Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations
in the flows bring functional programming aspects
into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
10Sunday, 28 July 13
11. Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
11Sunday, 28 July 13
12. Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
12Sunday, 28 July 13
13. Cascading, a workflow abstraction
Cascalog ➟ 2.0
Palo Alto case study
Open Data insights
13Sunday, 28 July 13
15. For the process used with this Open Data app,
we chose to use Cascalog
github.com/nathanmarz/cascalog/wiki
by Nathan Marz, Sam Ritchie, et al., 2010
a DSL in Clojure which implements
Datalog, backed by Cascading
Some aspects of CS theory:
• Functional Relational Programming
• mitigates Accidental Complexity
• has been compared with Codd 1969
15Sunday, 28 July 13
16. Accidental Complexity:
Not O(N) complexity, but the costs of software
engineering at scale over time
What happens when you build recommenders,
then go work on other projects for six months?
What does it cost others to maintain your apps?
“Out of theTar Pit”, Moseley & Marks, 2006
goo.gl/SKspn
Cascalog allows for leveraging the same framework,
same code base, from ad-hoc queries… to modeling…
to unit tests… to checkpoints in production use
This focuses on the process of structuring data:
specify what you require, not how it must be achieved
Huge implications for software engineering
16Sunday, 28 July 13
17. pros:
• most of the largest use cases for Cascading
• 10:1 reduction in code volume compared to SQL
• Leiningen build: simple, no surprises, in Clojure itself
• test-driven development (TDD) for Big Data
• fault-tolerant workflows which are simple to follow
• machine learning, map-reduce, etc., started in LISP
years ago anywho...
cons:
• learning curve, limited number of Clojure developers
• aggregators are the magic, those take effort to learn
17Sunday, 28 July 13
18. Q:
Who uses Cascalog, other than Twitter?
A:
• Climate Corp
• Factual
• Nokia
• Telefonica
• Harvard School of Public Health
• YieldBot
• uSwitch
• etc.
18Sunday, 28 July 13
22. (ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
22Sunday, 28 July 13
23. Cascading, a workflow abstraction
Cascalog ➟ 2.0
Palo Alto case study
Open Data insights
23Sunday, 28 July 13
24. Palo Alto is quite a pleasant place
•temperate weather
•lots of parks, enormous trees
•great coffeehouses
•walkable downtown
•not particularly crowded
On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk
24Sunday, 28 July 13
25. 1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk
(smartphone GPS logs)
✚
3. some curated metadata
(which surfaces the value)
4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sipping a latte or enjoying some fro-yo.”
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
25Sunday, 28 July 13
26. The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good
paloalto.opendata.junar.com/dashboards/7576/geographic-information/
discovery
26Sunday, 28 July 13
27. GIS about trees in Palo Alto:
discovery
27Sunday, 28 July 13
28. Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29
Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage: Appraised Value:
Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872
Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way
From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie
Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950
Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential
Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320
Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base
Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness:
Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1
Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure
Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity:
none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and
Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none
Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0
Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane:
0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority:
Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009
User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
discovery
(unstructured data…)
28Sunday, 28 July 13
29. (defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)
(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))
discovery
(specify what you require,
not how to achieve it…
80:20 cost of data prep)
29Sunday, 28 July 13
30. discovery
(ad-hoc queries get refined into
composable predicates)
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point
30Sunday, 28 July 13
37. ?blurb"" " " Hawthorne Avenue from Alma Street to High Street
?traffic_count"3110
?traffic_class"local residential
?surface_type" asphalt concrete
?albedo" " " 0.12
?min_lat"" " 37.446140860599854"
?min_lng "" " -122.1674652295435
?min_alt "" " 0.0
?geohash"" " 9q9jh0
(another data product)
discovery
37Sunday, 28 July 13
38. The road data provides:
•traffic class (arterial, truck route, residential, etc.)
•traffic counts distribution
•surface type (asphalt, cement; age)
This leads to estimators for noise, reflection, etc.
discovery
38Sunday, 28 July 13
40. GIS data from Palo Alto provides us with geolocation about each
item in the export: latitude, longitude, altitude
Geo data is great for managing municipal infrastructure as well as
for mobile apps
Predictive modeling in our Open Data
example focuses on leveraging geolocation
We use spatial indexing by creating
a grid of geohash values, for efficient
parallel processing
Cascalog queries collect items with the
same geohash values – using them as keys
for large-scale joins (Hadoop)
modeling
40Sunday, 28 July 13
41. 9q9jh0
geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162
modeling
41Sunday, 28 July 13
42. Each road in the GIS export is listed as a block between two
cross roads, and each may have multiple road segments to
represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0
modeling
( lat0, lng0, alt0 )
( lat1, lng1, alt1 )
( lat2, lng2, alt2 )
( lat3, lng3, alt3 )
NB: segments in the raw GIS have the order of geo coordinates
scrambled: (lng, lat, alt)
42Sunday, 28 July 13
43. 9q9jh0
X X
X
Filter trees which are too far away to provide shade. Calculate a sum
of moments for tree height × distance, as an estimator for shade:
modeling
43Sunday, 28 July 13
44. (defn get-shade [trees roads]
"subquery to join tree and road estimates, maximize for shade"
(<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric]
(roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash
?traffic_count _ ?traffic_class _ _ _ _)
(road-metric
?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _
?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height)
;; limit to trees which are higher than people
(> ?height 2.0)
(tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
;; limit to trees within a one-block radius (not meters)
(<= ?distance 25.0)
(/ ?height ?distance :> ?tree_moment)
(c/sum ?tree_moment :> ?sum_tree_moment)
;; magic number 200000.0 used to scale tree moment
;; based on median
(/ ?sum_tree_moment 200000.0 :> ?tree_metric)
))
modeling
44Sunday, 28 July 13
48. Recommenders often combine multiple signals, via weighted
averages, to rank personalized results:
•GPS of person ∩ road segment
•frequency and recency of visit
•traffic class and rate
•road albedo (sunlight reflection)
•tree shade estimator
Adjusting the mix allows for further personalization at the end use
(defn get-reco [tracks shades]
"subquery to recommend road segments based on GPS tracks"
(<- [?uuid ?road ?geohash ?lat ?lng ?alt
?gps_count ?recent_visit ?road_metric ?tree_metric]
(tracks ?uuid ?geohash ?gps_count ?recent_visit)
(shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
))
apps
48Sunday, 28 July 13
50. ‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔
apps
50Sunday, 28 July 13
51. Could combine this with a variety of data APIs:
• Trulia neighborhood data, housing prices
• Factual local business (FB Places, etc.)
• CommonCrawl open source full web crawl
• Wunderground local weather data
• WalkScore neighborhood data, walkability
• Data.gov US federal open data
• Data.NASA.gov NASA open data
• DBpedia datasets derived fromWikipedia
• GeoWordNet semantic knowledge base
• Geolytics demographics, GIS, etc.
• Foursquare,Yelp, CityGrid, Localeze,YP
• various photo sharing
apps
walkscore.com/CA/Palo_Alto
51Sunday, 28 July 13
52. Cascading, a workflow abstraction
Cascalog ➟ 2.0
Palo Alto case study
Open Data insights
52Sunday, 28 July 13
53. Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
53Sunday, 28 July 13
54. Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
Drivers, circa 2013
• governments running out of money,
cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products,APIs
• mobile devices enabling use cases
54Sunday, 28 July 13
55. Open Data notes
Successful apps incorporate three components:
•Big Data (consumer interest, personalization)
•Open Data (monetizing public data)
•Curated Metadata
Most of the largest Cascading deployments leverage some
Open Data components: Climate Corp, Factual, Nokia, etc.
Notes about Open Data use cases: goo.gl/cd995T
Consider buildingeye.com, aggregate building permits:
•pricing data for home owners looking to remodel
•sales data for contractors
•imagine joining data with building inspection history,
for better insights about properties for sale…
55Sunday, 28 July 13
56. Open Data ecosystem
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Data feeds structured for
public private partnerships
56Sunday, 28 July 13
57. Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• respond to viable use cases
• not budgeting hackathons
57Sunday, 28 July 13
58. Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs
58Sunday, 28 July 13
59. Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data
59Sunday, 28 July 13
60. Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream,
i.e., cleaned data at source
• focus on core verticals
60Sunday, 28 July 13
61. Open Data ecosystem – caveats
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, San Francisco, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook
Required Focus
• address consumer needs
• identify community benefits
of the data
61Sunday, 28 July 13