Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Data Science with the Help of Metadata
1. Data Science with the Help of Metadata
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
www.hops.io
@hopshadoop
2. Metadata for Source Code
•Metadata for Source Code
- Enables questions like: who, when, what, why?
•Metadata for Automation
- Enables testing, quality-control, deployment.
•Metadata for Collaboration
- Github projects, teams
3. Metadata for Datasets?
•Access Control
•Data provenance
•Auditing
•Development
- Schema for the dataset
- How can I load/download this dataset?
- Quality control
3
4. Metadata can simplify development
sqlContext = HiveContext(sc)
f1_df = sqlContext.sql(
"SELECT id, count(*) AS nb_entries
FROM my_db.log
WHERE ts = '20160515'
GROUP BY id"
)
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')
fpFields = [
StructField(‘ts', StringType(), True),
StructField('id', StringType(), True),
StructField(‘it', StringType(), True)
]
fpSchema = StructType(fpFields)
df_f0 = sqlContext.createDataFrame(f0,
fpSchema)
df_f0.registerTempTable('log')
f1_df = sqlContext.sql(
"SELECT log.id, count(*) AS nb_entries
FROM log WHERE ts = '20160515‘
GROUP BY id“
)
4
SparkSQLHive-on-Spark
6. Metadata for Files/Directories in HDFS
6
Add Schemas using
the Filesystem API
Add auditing using
the FSImage API
Add access control using
a Filesystem Plugin
7. Access Control in Hadoop
hdfs dfs -chmod -R 000 /apps/hive
7
[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
8. Metadata Totem Poles in Hadoop
8How do you ensure the consistency of the metadata and the data?
17. Metadata for HDFS and YARN
17
Files
Directories
Containers
Provenance
Security
Quotas
Projects
Datasets
Metadata + Data in the same Database
2-phase commit (transactions)
Strong Consistency for Metadata.
Metadata Integrity maintained using 2PC and Foreign Keys.
21. Problem: Sensitive Data needs its own Cluster
21
NSA DataSet
User DataSet
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity.
Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice
22. Solution: Project-Specific UserIDs
22
Project NSA
Project Users
Member of
NSA__Alice
Users__Alice
Member of
HDFS enforces
access control
How can we share DataSets between Projects?
23. Sharing DataSets between Projects
23
Project NSA
Project Users
Member of
DataSetowns
Add members of Project
NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
25. X.509 Certificate Per Project-Specific User
25
Alice@gmail.com
Authenticate
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests
26. Project
•A project is a collection of
- Members
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•A project has an owner
•A project has quotas
26
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
27. Project Roles
Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist Privileges
- Write and Run code
27
We delegate administration of privileges to users
28. Elastic Hadoop
Each Project has a:
• YARN CPU Quota
• HDFS Storage Quota
Uber-Style Pricing to
incentivize cluster usage
28
31. www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
33. Status and Upcoming
•Automated installation support using Vagrant/Chef
or Karamel/Chef
•First official release of Hopsworks coming soon
•Globally shared datasets with peer-to-peer
technology, backed by our data center.
•Support for Apache Beam
34. Summing Up
Metadata services have the potential to make your
life easier as a Data Scientist
Most Hadoop Metadata services are proprietary and
require an administrator-in-the-loop
Hops provides an open, tinker-friendly platform for
building consistent metadata
Hopsworks shows how you can leverage metadata to
build a self-service project-based model for
Hadoop/Spark/Flink applications
34
35. The Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Vasileios Giannokostas, Ermias Gebremeskel,
Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,
Paul Mälzer, Bram Leenders, Juan Roca.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.