2. Whoami
I work at TVN S.A., which is the biggest polish commercial TV broadcaster. We
have broad variety of TV channels (linear TV) like TVN, TVN24, HGTV, TVN Style,
TVN Turbo. We produce great content, which we share at our VoD platform
player.pl (non-linear TV). www.tvn24.pl is our news portal, where people can watch
live stream of TVN24 channel and read breaking news.
As BigData Solutions Architect I lead Data Engineers team.
We support Data Lake and make data available for business users.
We also build data processing pipelines.
I am Cloudera Certified Administrator for Apache Hadoop.
https://www.linkedin.com/in/arturr/
3. How to start with
the Data Lake
Tip
Dig great hole and ask
friends to dive into the
lake with you.
4. Business clients
Write down who/what will use the Data Lake.
It can be Data Scientists, Data Analysts or
applications.
It will affect SLA’s and security levels.
They will take the responsibility for collecting the
information out of data and their proper
interpretation.
5. Data sources inventory
Check sizing
Daily size will let you know how big storage you will need for the data. HDFS stores 3 copies by
default, but data can be compressed.
Format
What is current data format (CSV, some log with space separated fields, multiline, stream format
(AVRO, JSON))
Interfaces
External/internal, firewall issues, security, encoding
Ingest frequency
How often does the business users want the date to be available for them.
For analytics it is usually 1day. For online tracking it might be 1minute, 10s or even less then 1s.
6. Teams definitions
Data Scientists
Responsible for asking the right questions against data with properly build/choose algorithms to find the right answers. Need to have
crossdomain skills/experience to see the “big picture”. Usually should know: machine learning, programming in Java/Scala/Python, R, SQL, also
should know statistical analysis, conceptual modeling.
"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
Josh Wills former Director of Data Science at Cloudera
Data Analysts
Helps Data Scientists in their jobs. Performs analysis on data sets from particular system. Should know data mining and how to prepare
summary reports also should have high skills in using Business Intelligence tools like Tableau or Qlik.
“Junior Data Scientist”
Data Engineers
Builds data processing systems with Hadoop using self written applications in Java/Scala/Python for Spark or using software like Hive, Sqoop,
Kafka, Flume or NiFi. They decide about hardware and software needs. They keep the Data Platform up and running.
7. Technology
Hadoop distribution
Cloudera, HortonWorks, MapR, self build, cloud, other.
Services to install
Depends on business user needs. For example if you plan to use complex processing pipelines install Luigi or Jenkins
instead of Oozie. Less number of services is better for fast start with Data Lake and later support.
BI Tools
Depends on business user needs.
8. Security
Securing your Data Lake is always a good idea, but will delay the start of Data Lake, so you need to balance when to enable
security.
The earlier you secure your Hadoop Data Lake the less problems it will create, because Data Lake will grow fast with
number of users, data sources, ETL’s and services.
By default Hadoop trusts that user presented at login is that user.
Securing Hadoop is at least to enable Kerberos authentication (often with LDAP/AD backend).
Full AAA Security model with active auditing is very time consuming to implement and might need commercial support.
Authentication needs Kerberos
Authorization needs Sentry or Ranger.
Auditing can be passive (good for start) or active.
9. Infrastructure
Usually it is better to have more smaller servers then less strong ones - we want to parallel the computations.
You might need strong boxes (with great amount of RAM) for services like Impala or Spark, that will do in-memory
processing.
It is always a good idea to have 4x 1Gbps network ports as LACP bond if your LAN switches supports this.
It is also good idea to plan HDFS tiering or to send “cold data” to cheap cloud storage.
Put worker nodes (HDFS DataNode + YARN NodeManagers) on bare metal machines. Though masters virtualization is
something to consider.
11. Data Lake environments
Minimum 2 environments (RC and Prod), the best is to have minimum 3.
Environment is Hadoop cluster + BI Tools + ETL jobs and any other services you will change/implement in Production.
1. Testing (R&D) - You can test new services there, new functionalities. It can be shutdown any time. Users can have
their own environments for their own tests. It might be totally virtualized.
2. Release Candidate (RC) - Has the same configuration as Production, but with minimal hardware resources - can be
virtualized). For testing software upgrades, and configuration changes. For example, when you implement Kerberos
authentication this is a “must have“ environment. Access has only selected users, that need to prepare production
change.
3. Production - Business users must obey the rules in order to keep the Data Lake services SLA’s.
12. Storage file formats
1. Row (easily human readable, slow)
a. Used in RDBMS for SUID queries, that works on many columns at a time
b. Not so good for aggregations, because you need to do full table scan
c. Not so good for compression, because in one row are neighboring various data types
1,1,search,122.303.444,t_{0},/products/search/filters=tshirts,US; --row 1
2,1,view,122.303.444,t_{1},/products/id=11212,US; --row 2
...;
7,9,checkout,121.322.432,t_{2},/checkout/cart=478/status=confirmed,UK; --row 7
2. Columnar (very hard to read by human, fast)
a. Values from particular column are stored as neighbors to each other.
b. Very good for aggregations because you have to only get blocks of columns, without full table scan
c. Very good for compression, the same value types are in one row (ex. Runlength encoding for integers, Dictionary for strings).
1,2,3,4,5,6,7;
1,3,9;
Search,view,add_to_cart,checkout;
...;
US,FR,UK;
13. Storage file formats - TEXTFILE
TEXTFILE
Row format. Very good for start. Good for storing CSV/TSV.
CREATE EXTERNAL TABLE `mydb.peoples_transactions` (
`id` INT COMMENT 'Personal id',
`name` STRING COMMENT 'Name of a person',
`day` STRING COMMENT 'Transaction date' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION '/externals/mydb/peoples_transactions';
14. Storage file formats - Avro
- Selfdescribing - schema can be part of the data file.
- Supports column aliases for schema evolution.
- Schema can be defined in file (SERDEPROPERTIES) or in table.
definition (TABLEPROPERTIES).
- Easy transformation from/to json.
CREATE TEMPORARY EXTERNAL TABLE temporary_myavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES
('avro.schema.url'='/avro_schemas/myavro.avsc')
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/externals/myavro';
TABLEPROPERTIES ('avro.schema.literal'='{
"namespace": "testing.hive.avro.serde",
"name": "peoples_transactions",
"type": "record",
"fields": [
{ "Name":"id", "type":"int",
"doc":"Personal id" },
{ "Name":"name", "type":"string",
"doc":"Name of a person" },
{ "Name":"day", "type":"string",
"doc":"Transaction date" },
],
"doc":"Table with transactions"
})
15. Storage file formats - Parquet
Parquet
The most widely used columnar file format. Supported by
Cloudera’s Impala in memory engine, Hive and Spark. Has basic
statistic - number of elements in column stored in particular row
group. Very good for Spark when using Tungsten.
16. Storage file formats - ORC
ORC
- Has 3 index levels (file, stripe and 10k rows).
- Even 78% smaller files.
- Basic statistics: min, max, sum, count per column, per stripe and file.
- When inserting into table try to sort data by most used column.
- Supports predicate pushdown.
18. Storage file formats - example
ORC write and load into DataFrame
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
val records = (1 to 100).map { i =>;
Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") })
}
sc.parallelize(records).toDF().write.format("orc").save("people")
val people = sqlContext.read.format("orc").load("people")
people.registerTempTable("people")
Predicate pushdown example
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
sqlContext.sql("SELECT name FROM peopleWHERE age < 15").count()
19. Storage file formats - compression
None - fast data access, but large files sizes and large network
bandwidth.
SNAPPY - written by Google. Very fast, but not splittable -
decoding will be done on one CPU Core. Compression level about
40%. Usually the best choice.
Gzip - Good compression ratio. Slower than Snappy. High CPU
usage.
20. Batch ingestion
There are various methods to do batch ingestion depending from
the source of data. Sqoop is used mostly to import from RDBMS.
Direct files upload to Hive is also possible. There are also other
tools like NiFi to monitor and orchestrate the ingestion.
21. Batch ingestion - sqoop
- Very good for copying tables from RDBMS into HDFS files or Hive tables.
- Number of output files can be steered by number of mappers used.
- Connects using JDBC drivers for particular database but not only.
- By default stores data in Textfiles, SequenceFiles and AVRO.
- Supports HCatalog - useful for import to other storage formats.
- Supports incremental import based on last-value (append and lastmodified).
- You can specify query to be imported.
- Supports compression Gzip (by default not enabled) and other algorithms.
- Carefull must be taken for many exceptions when importing, for example:
- Fields in database can have new lines characters this can be problem when importing into Hive,
where table rows delimiter is also ‘n’
- Nulls by default are treated as string ‘null’, Hive uses N ( --null-string (string columns) and
--null-non-string (not string columns) with escaping )
22. Batch ingestion - sqoop into ORC
sqoop import
-Dmapreduce.job.queuename=your.yarn.scheduler.queuename
--connect jdbc:mysql://myserver:3306/mydatabase
-username srvuser
--password-file file:///path/to/.sqoop.pw.file
-m 1
--null-string "N"
--null-non-string "N"
--table mytable
--hcatalog-database myhivedb
--hcatalog-table mytable
--hcatalog-partition-values "`date +%F`"
--hcatalog-partition-keys 'day'
--hcatalog-storage-stanza 'STORED AS ORCFILE'
23. Batch ingestion - Hive
1. Import to external table
a. Copy files into the HDFS to external table location directory using hdfs dfs -put.
b. Partitioning is a good practice. Usually by date, for example:
/externals/myparttab/y=2017/m=04/d=29/
After uploading file to a new partition you need to create this partition in Hive metastore with MSCK
REPAIR TABLE command.
2. Import to optimized storage
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE mydb.mytable
PARTITION(`day`)
SELECT
`id`, `name`, `day`
FROM mydb.mytable_externtal;
24. Batch ingestion - NiFi
Powerful tool for controlling and monitoring the data flow. In GUI you
build graph of configurable processors and their relationships. You can
change data formats on the fly (ex. JSON into AVRO). Has many
processors for HDFS, Hive, Kafka, Flume, JDBC interfaces, and many
others. Might be used for both batch (Interval or Cron based) or stream
processing.
26. Data export
Hive export to Comma/Tab/Delimiter separated values formats.
CONNSTRING="jdbc:hive2://my.hive.com:10000/;principal=hive/my.hive.com@MYREALM.COM?mapreduce.job.queuename=my
.queue"
DAY=`date +%F`
/usr/bin/beeline -u ${CONNSTRING} --outputformat=tsv2 --showHeader=false --hivevar DAY=$DAY -e "SELECT * FROM
mydb.mytable where day='${hivevar:DAY}'" > mytable.${DAY}.tsv
For DSV --delimiterForDSV='ł'
27. Data export - Spark
Complex export can be run in Spark or MapReduce application. For example the easiest way to export to RDBMS from
Spark is direct write from DataFrame:
val prop = new java.util.Properties
val jurl = "jdbc:sqlserver://my.sql.com:1433;databaseName=mydb"
val rdbmsTab = "mytab"
def main(args: Array[String]): Unit = {
prop.setProperty("user", "myuser")
prop.setProperty("password", "XXX")
prop.setProperty("driver", "rdbms.jdbc.driver")
val sc = new SparkContext()
val sqlContext = new HiveContext(sc)
val myDF = sqlContext.sql("""
SELECT country, count(id), day
FROM mydb.mytab
WHERE day < from_unixtime(unix_timestamp(),'yyyy-MM-dd')
GROUP BY day, country
""")
myDF.write.mode("overwrite").jdbc(jurl, rdbmsTab, prop)
}
If you need to make UPDATES you need to use default JDBC DriverManager, because DataFrames can write in "error",
"append", "overwrite" and "ignore" modes.
28. Machine Learning model livecycle
Livecycle basing on Spark
1. Data Scientist trains the model save it as PMML or Spark ML
Pipeline
2. Depending on the need, ML model can be used by Data
Engineer in order to:
a. Once a day recalculate data and export them for
example to RDBMS.
b. Load saved model and expose it for example via REST
API, or update in-memory store like Druid.
c. Use Oryx for online model upgrades.
3. Data Scientists must have access to mesure model
effectiveness.
30. Spark considerations
- DF better than RDD (collection of java objects)
- Don't cache because of serialization.
- When using Spark streaming it is better to log the error to a
batch then throw it 100k times.
- Kafka’s best message size is 10k - 100k. Even 2 rows in one
transaction are better than one.
31. Environments for DA and DS
- Hue - primary tool for Analysts and Data Scientists.
- Beeline for accessing Hive from CLI.
- JDBC for connecting to Hive tables from Excel.
- Self-service BI tools (Tableau, Qlik, etc.).
- Jupyter - notebook for DataScientists. Can run
Scala/Python/R in one flow.
32. What’s next
- DataLake 3.0 (Horton Works)
- Application assembly - run multiple services in dockerized containers on YARN. Each can have it’s own
environment.
- Auto-tiering - for automatic data movement between tiers.
- Network and IO level isolation.
- Cloudera Data Science Workbench
- Collaborative platform for Data Scientists.
- Generally available since June 2017.
- Spark 2
- SparkContext and HiveContext are rebuild into SparkSession.
- Adds spark-csv library. In Spark 1.x you have to use library from Databricks manually.
- Global temporary views available for other sessions.