Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

•

24 gefällt mir•4,158 views

This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. By Vida Ha at Spark Summit East 2016.

Ingenieurwesen

Not Your Father’s Database:
How to Use Apache Spark Properly  
in Your Big Data Architecture
Spark Summit East 2016

About Me
2005 Mobile Web & Voice Search
3

About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics

About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering

This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL

But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS

Just in Time Data Warehouse w/ Spark
HDFS

Just in Time Data Warehouse w/ Spark
and more…
HDFS

11
Know when to use other data stores  
besides file systems
Today’s Goal

Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
12

Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
13
Good: General Purpose Processing

Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14
Good: General Purpose Processing

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have  
to go through all your files to find your row.
16

Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index  
on your key column.
• Key-Value NOSQL stores retrieves the value  
of a key efficiently out of the box.
17

Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
18

Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
19

Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
20

Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
21

Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental
SQL Query
Database
Snapshot
+

Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS

Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS

Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB

Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• Data cleansing, featurization, training, testing, etc.
26

Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
27

Weitere ähnliche Inhalte

Was ist angesagt?

Distributed ML in Apache Spark

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Enabling exploratory data science with Spark and R

Databricks

New Developments in Spark

Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Spark Under the Hood - Meetup @ Data Science London

Databricks

Announcing Databricks Cloud (Spark Summit 2014)

Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search. Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution) Avec l'extraction de données stockées dans une base de données relationnelle à l'aide d'un outil de BI avancé, et avec l'envoi via Kafka des données vers Tachyon, plusieurs sessions Spark peuvent travailler sur le même dataset en limitant la duplication. On obtient grâce à cela une communication à coût contrôlé entre la base de données d'origine et Spark ce qui permet de réintroduire de manière dynamique les données modifiées avec MLlib tout en travaillant sur des données à jour. Les résultats préliminaires seront partagés durant cette présentation.

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Modern Data Stack France

This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes Speaker: Matthew Powers

Optimizing Delta/Parquet Data Lakes for Apache Spark

Databricks

In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming: Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.

Spark streaming state of the union

Databricks

Spark Summit EU 2015: Reynold Xin Keynote

Databricks

Introduction to Spark (Intern Event Presentation)

Databricks

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Databricks

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming. Speaker: Matei Zaharia Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017 This talk was originally presented at Spark Summit East 2017.

What to Expect for Big Data and Apache Spark in 2017

Databricks

What's new in pandas and the SciPy stack for financial users

Wes McKinney

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Databricks

As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we also want to make Spark accessible to a wider set of users, through new high-level APIs targeted at data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and third-party packages. Like all work on Spark, these APIs are designed to plug seamlessly into existing Spark applications, giving users a unified platform for streaming, batch and interactive data processing.

New Directions for Spark in 2015 - Spark Summit East

Databricks

H2O World - H2O Rains with Databricks Cloud

Sri Ambati

A look ahead at spark 2.0

Databricks

Was ist angesagt? (20)

Distributed ML in Apache Spark

Enabling exploratory data science with Spark and R

New Developments in Spark

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Under the Hood - Meetup @ Data Science London

Announcing Databricks Cloud (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Optimizing Delta/Parquet Data Lakes for Apache Spark

Spark streaming state of the union

Spark Summit EU 2015: Reynold Xin Keynote

Introduction to Spark (Intern Event Presentation)

Spark Summit 2015 keynote: Making Big Data Simple with Spark

What to Expect for Big Data and Apache Spark in 2017

What's new in pandas and the SciPy stack for financial users

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

New Directions for Spark in 2015 - Spark Summit East

H2O World - H2O Rains with Databricks Cloud

A look ahead at spark 2.0

Andere mochten auch

Operational Tips for Deploying Spark

Databricks

Apache Spark Use case for Education Industry

Vinayak Agrawal

Cancer Outlier Profile Analysis using Apache Spark

Mahmoud Parsian

How Totango uses Apache Spark

Oren Raboy

Getting Apache Spark Customers to Production

Cloudera, Inc.

Kodu Game Lab e Project Spark

Fabrício Catae

SampleClean: Bringing Data Cleaning into the BDAS Stack

jeykottalam

Fighting Fraud with Apache Spark

Miklos Christine

Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application. Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Modern Data Stack France

Moa: Real Time Analytics for Data Streams

Albert Bifet

Breakthrough OLAP performance with Cassandra and Spark

Evan Chan

Traditionally, data warehouse platforms have been perceived as cost prohibitive, challenging to maintain and complex to scale. The combination of Apache Spark and Spark SQL – running on AWS – provides a fast, simple, and scalable way to build a new generation of data warehouses that revolutionizes how data scientists and engineers analyze their data sets. In this webinar you will learn how Databricks - a fully managed Spark platform hosted on AWS - integrates with variety of different AWS services, Amazon S3, Kinesis, and VPC. We’ll also show you how to build your own data warehousing platform in very short amount of time and how to integrate it with other tools such as Spark’s machine learning library and Spark streaming for real-time processing of your data.

Building a Turbo-fast Data Warehousing Platform with Databricks

Databricks

Consumer offset management in Kafka

Joel Koshy

Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...

MapR Technologies Japan

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1FQYcP0. Gian Merlino presents the advantages, challenges, and best practices to deploying and maintaining lambda architectures in the real world, using the infrastructure at Metamarkets as a case study. Filmed at qconsf.com. Gian Merlino is a senior software engineer at Metamarkets, responsible for the infrastructure behind its data ingestion pipelines and is a committer on the Druid project.

Lambda Architectures in Practice

C4Media

Spark For Faster Batch Processing

Edureka!

Not Your Father's Database by Vida Ha

Spark Summit

Akka Streams are an implementation of the Reactive Streams specification (http://reactive-streams.org/), a joint effort that aims at standardizing the exchange of streams of data across asynchronous boundaries in a fully non-blocking way while providing flow control and mediating back pressure. In this presentation we go into the details of what this new abstraction can be used for and what the guiding principles are behind its development. We then focus on one prominent use-case which is the upcoming Akka HTTP module: a fully stream-enabled, reactive HTTP server and client implementation.

Akka Streams and HTTP

Roland Kuhn

Andere mochten auch (18)

Operational Tips for Deploying Spark

Apache Spark Use case for Education Industry

Cancer Outlier Profile Analysis using Apache Spark

How Totango uses Apache Spark

Getting Apache Spark Customers to Production

Kodu Game Lab e Project Spark

SampleClean: Bringing Data Cleaning into the BDAS Stack

Fighting Fraud with Apache Spark

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Moa: Real Time Analytics for Data Streams

Breakthrough OLAP performance with Cassandra and Spark

Building a Turbo-fast Data Warehousing Platform with Databricks

Consumer offset management in Kafka

Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...

Lambda Architectures in Practice

Spark For Faster Batch Processing

Not Your Father's Database by Vida Ha

Akka Streams and HTTP

Ähnlich wie Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

During this Big Data Warehousing Meetup, Caserta Concepts and Databricks addressed the number one operational and analytic goal of nearly every organization today – to have complete view of every customer. Customer Data Integration (CDI) must be implemented to cleanse and match customer identities within and across various data systems. CDI has been a long-standing data engineering challenge, not just one of logic and complexity but also of performance and scalability. The speakers brought together best practice techniques with Apache Spark to achieve complete CDI. Speakers: Joe Caserta, President, Caserta Concepts Kevin Rasmussen, Big Data Engineer, Caserta Concepts Vida Ha, Lead Solutions Engineer, Databricks The sessions covered a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Topics included: · Building an end-to-end CDI pipeline in Apache Spark · What works, what doesn’t, and how do we use Spark we evolve · Innovation with Spark including methods for customer matching from statistical patterns, geolocation, and behavior · Using Pyspark and Python’s rich module ecosystem for data cleansing and standardization matching · Using GraphX for matching and scalable clustering · Analyzing large data files with Spark · Using Spark for ETL on large datasets · Applying Machine Learning & Data Science to large datasets · Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally The speakers also touched on data governance, on-boarding new data rapidly, how to balance rapid agility and time to market with critical decision support and customer interaction. They also shared examples of problems that Apache Spark is not optimized for. For more information on the services offered by Caserta Concepts, visit our website: http://casertaconcepts.com/

Not Your Father's Database by Databricks

Caserta

Slide 2 collecting, storing and analyzing big data

Trieu Nguyen

When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Databricks

Agile data warehousing

Sneha Challa

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

Spark Summit

NoSQL databases

Filip Ilievski

Slides for the talk at AI in Production meetup: https://www.meetup.com/LearnDataScience/events/255723555/ Abstract: Demystifying Data Engineering With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood. In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.

Demystifying data engineering

Thang Bui (Bob)

Etl with apache impala by athemaster

Athemaster Co., Ltd.

The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.

One Large Data Lake, Hold the Hype

Jared Winick

One Large Data Lake, Hold the Hype

Koverse, Inc.

Bender kuszmaul tutorial-xldb12

Atner Yegorov

Data Structures and Algorithms for Big Databases

omnidba

The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems. This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R. Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.

Big Data Technologies and Why They Matter To R Users

Adaryl "Bob" Wakefield, MBA

Hands On: Introduction to the Hadoop Ecosystem

Adaryl "Bob" Wakefield, MBA

Data modeling trends for analytics

Ike Ellis

No sq lv1_0

Tuan Luong

Splice machine-bloor-webinar-data-lakes

Edgar Alejandro Villegas

Speaker: Marc Fielding, Co-speaker: Maris Elsins. Oracle Database Appliance provides a robust, highly-available, cost-effective, and surprisingly scalable platform for database as a service environment. By leveraging Oracle Enterprise Manager's self-service features, databases can be provisioned on a self-service basis to a cluster of Oracle Database Appliance machines. Discover how multiple ODA devices can be managed together to provide both high availability and incremental, cost-effective scalability. Hear real-world lessons learned from successful database consolidation implementations.

Database as a Service on the Oracle Database Appliance Platform

Maris Elsins

The modern analytics architecture

Joseph D'Antoni

Connecting Hadoop and Oracle

Tanel Poder

Ähnlich wie Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture (20)

Not Your Father's Database by Databricks

Slide 2 collecting, storing and analyzing big data

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Agile data warehousing

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

NoSQL databases

Demystifying data engineering

Etl with apache impala by athemaster

One Large Data Lake, Hold the Hype

Bender kuszmaul tutorial-xldb12

Data Structures and Algorithms for Big Databases

Big Data Technologies and Why They Matter To R Users

Hands On: Introduction to the Hadoop Ecosystem

Data modeling trends for analytics

No sq lv1_0

Splice machine-bloor-webinar-data-lakes

Database as a Service on the Oracle Database Appliance Platform

The modern analytics architecture

Connecting Hadoop and Oracle

Mehr von Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Democratizing Data Quality Through a Centralized Platform

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Learn to Use Databricks for Data Science

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

Why APM Is Not the Same As ML Monitoring

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Sawtooth Windows for Feature Aggregations

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Re-imagine Data Monitoring with whylogs and Spark

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Massive Data Processing in Adobe Using Delta Lake

Databricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Kürzlich hochgeladen

Call Girl Aurangabad Indira Call Now: 8617697112 Aurangabad Escorts Booking Contact Details WhatsApp Chat: +91-8617697112 Aurangabad Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Aurangabad understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide –

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

Call Girls in Nagpur High Profile Call Girls

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand Booking Contact Details :- WhatsApp Chat :- +91-7737669865 Call Girls In Model Towh +91-7737669865 !! Best Woman Seeking Man Call Girls Service, Escorts Service in Home Hotel in NCR 24 Hours Available Service Call Girls, Contact Us +91-7737669865 (Any Time. Any Where) Call Girls in , Noida, Gurgaon, Ghaziabad,Sexy Indian Female Escorts Service NCRWelcome To Escorts Service – An All Over New Very Sexy Hot Call Girls Agency Service Escorts In South NCR’s No. 1 High Profile Independent Female Escorts Service. We Provide Good Quality Educated Profile At #K09 Very Regnebal Price 100% Safe And Original.We Are Provide Escorts Service All OYO Hotels ,3*,4*,5* Star Hotel And Home Flat, Apartment. Guest-House. Services In -Call And Out – Call Both Are Services Available. 24Hrs. Any Time Any Where. In All Over Noida Gurgaon Ghaziabad Faridabad.More Information And Contact Profile Real Pic Visit Our Website City Wise Escorts Service Agency.Good Looking Cheap And Best Models Girls U Can Get Best Click On Link……Night Call Girls Now In Hotel Le Meridien Gurgaon Near Female Escort One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7737669865 We are available 24*7 all days of the year. Call us — 7737669865 Thank you for Visiting.

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

amitlee9823

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

Suman Jyoti

Call Girls In Mahipalpur (Delhi) Call or Whataap {+91-8377877756} Escorts Provide 24×7 Available With Room ☆☆TIMINGS 24 HOURS OPENS☆☆☆ Booking Now Gentleman Only:-Call Now Best High Class Normal Call Girls Escorts Service In Delhi NCR 24-7 Hours Available Service I provide In Delhi NCR Female Escorts Sex Service 100% Customers Satisfaction Guarantee VIP Profiles Top Grade Service 100% Cooperative All round Service ꧁❤8377877756❤꧂ InCall: – You Can Reach At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation OutCall: – Service For Out Call You Have To Come Pick The Girl From My Place. We Also Provide Door Step Services. Note: – Pic Collectors Time Passers And Bargainers Stay Away As We Respect The Value Of Your Money And Time And Expect The Same From You. 8377877756. Hygienic: – Full Ac Neat And Clean Rooms Available In Hotel 24 * 7 Hrs In Delhi NCR 8377877756. Place: – South Extension Nehru Place Saket Malviya Nagar Munirka Vasant Kunj Safdarjung Katwaria Sarai Lajpat Nagar Kalkaji Hauz Khas Mahipalpur Dwarka Karol Bagh Noida Gurgaon Faridabad All Outcall Only Hotel Service In Delhi Ncr 8377877756. We are providing. : – House Wife’s : – Private Independent House Wife’s : – Private Independent Collage Going Girls : – Corporate MNC Working Profiles : – Call Center Girls : – Live Band Girls : – Foreigners and Many More: – Independent Models Service type : – Incall Short 2500 : – Outcall Short 3500 : – Incall Night 8500 : – Outcall Night 95,00 For Pictures And Other Details Pls Whatsapp Me Otherwise Call Me Any Time Incall Outcall Both Are Services Available Door Step, House, Apartment, Guest House, Flate, All Star Hotel Available ꧁❤8377877756❤꧂ THANKS FOR VISITING. Booking 24×7 HRS

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756

dollysharma2066

Double Revolving field theory-how the rotor develops torque

BhangaleSonal

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For Sex At Your Doorstep Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 02-may-2024(v.n)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...

tanu pandey

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking Booking Now open +91- 7737669865 Why you Choose Us- +91- 7737669865 HOT⇄ 7737669865 Mr ashu ji Call Mr ashu Ji +91- 7737669865 (V020524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking

roncy bisnoi

ONLINE FOOD ORDER SYSTEM is a website designed primarily for use in the food delivery industry. This system will allow hotels and restaurants to increase scope of business by reducing the labor cost involved. The system also allows to quickly and easily manage an online menu which customers can browse and use to place orders with just few clicks. Restaurant employees then use these orders through an easy to navigate graphical interface for efficient processing.

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

Kamal Acharya

Welcome to the April edition of WIPAC Monthly, the magazine brought to you by Water Industry Process Automation & Control. In this month's edition, along with the latest news from the industry we have articles on: The use of artificial intelligence and self-service platforms to improve water sustainability A feature article on measuring wastewater spills An article on the National Underground Asset Register Have a good month, Oliver

Water Industry Process Automation & Control Monthly - April 2024

Water Industry Process Automation & Control

notes on Evolution Of Analytic Scalability.ppt

MsecMca

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to 25K High Profile Escorts In Pune Booking Now open +91- 8005736733 Why you Choose Us- +91- 8005736733 HOT⇄ 8005736733 Mr ashu ji Call Mr ashu Ji +91- 8005736733 (V030524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models • Foreigner Models • TV Actress and Celebrities • Receptionist • Air Hostess • Call Center Working Girls/Women • Hi-Tech Co. Girls/Women • Housewife

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

SUHANI PANDEY

African Journal of Biological Sciences is an International peer-reviewed, Open Access journal that publishes original research articles as well as review articles in all areas of Biological Sciences. It operates a fully open access publishing model which allows open global access to its published content. This model is supported through Article Processing Charges. For more information on Article Processing charges click here. Its scope embraces Animal Sciences, Biochemistry, Bioinformatics, Biotechnology, Botany, Cell Biology, Developmental Biology, Ecology, Environmental Sciences, Ethno Medicine, Food Science, Freshwater Biology, Genetics, Immunology, Marine Biology, Microbiology, Molecular Biology, Physiology, Plant Sciences, Structural Biology,Toxicology,Zoology etc. It is essential that authors prepare their manuscripts according to established specifications. Failure to follow them may result in papers being delayed or rejected. Therefore, contributors are strongly encouraged to read the author guidelines carefully before preparing a manuscript for submission. The manuscripts should be checked carefully for grammatical, punctuation errors. All papers are subjected to peer review. All articles published in this journal represent the opinion of the authors and not reflect the official policy of the Journal of African Journal of Biological Sciences

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Christo Ananth

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Booking Booking Now open +91- 7737669865 Why you Choose Us- +91- 7737669865 HOT⇄ 7737669865 Mr ashu ji Call Mr ashu Ji +91- 7737669865 (V020524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

roncy bisnoi

Call Girl Meerut Indira Call Now: 8617697112 Meerut Escorts Booking Contact Details WhatsApp Chat: +91-8617697112 Meerut Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Meerut understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide –

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7

Call Girls in Nagpur High Profile Call Girls

Online banking management system project.pdf

Kamal Acharya

The Educational Administration: Theory and Practice publishes prominent empirical and conceptual articles focused on timely and critical leadership and policy issues of educational organizations. The journal embraces traditional and emergent research paradigms, methods, and issues. The journal particularly promotes the publication of rigorous and relevant scholarly work that enhances linkages among and utility for educational policy, practice, and research arenas. The goal of the editorial team and the journal’s editorial board is to promote sound scholarship and a clear and continuing dialogue among scholars and practitioners from a broad spectrum of education. Educational Administration: Theory and Practice presents prominent empirical and conceptual articles focused on timely and critical leadership and policy issues facing educational organizations. As an editorial team, we embrace traditional and emergent theoretical frameworks, research methods, and topics. We particularly promote the publication of rigorous and relevant scholarly work with utility for educational policy, practice, and research. The journal’s primary focus is on studies of educational leadership, organizations, leadership development, and policy as they relate to elementary and secondary levels of education. Examinations of leadership and policy that fall outside K-12 are considered insofar as there are meaningful connections to the K-12 arena (e.g., college pipeline). International comparative investigations are welcome to the extent they have implications for a broad audience.s.

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

Christo Ananth

N-Grade deals with the maintenance of university, department, faculty, student information within the university. N-Grade is an automation system, which is used to store the department, faculty, student, courses and information of a university. Starting from registration of a new student in the university, it maintains all the details regarding the attendance and marks of the students. The project deals with retrieval of information through an INTRANET based campus wide portal. It collects related information from all the departments of an organization and maintains files, which are used to generate reports in various forms to measure individual and overall performance of the students.

University management System project report..pdf

Kamal Acharya

Unleashing the Power of the SORA AI lastest leap

RishantSharmaFr

Kürzlich hochgeladen (20)

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756

Double Revolving field theory-how the rotor develops torque

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

Water Industry Process Automation & Control Monthly - April 2024

notes on Evolution Of Analytic Scalability.ppt

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7

Online banking management system project.pdf

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

University management System project report..pdf

Unleashing the Power of the SORA AI lastest leap

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

1. Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

2. Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

3. About Me 2005 Mobile Web & Voice Search 3

4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics

5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering

6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL

7. But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS

8. Just in Time Data Warehouse w/ Spark HDFS

9. Just in Time Data Warehouse w/ Spark HDFS

10. Just in Time Data Warehouse w/ Spark and more… HDFS

11. 11 Know when to use other data stores   besides file systems Today’s Goal

12. Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12

13. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing

14. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing

15. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15

16. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have   to go through all your files to find your row. 16

17. Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index   on your key column. • Key-Value NOSQL stores retrieves the value   of a key efficiently out of the box. 17

18. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18

19. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19

20. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20

21. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21

22. Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +

23. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS

24. Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS

25. Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB

26. Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26

27. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27

28. Thank you

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Ähnlich wie Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture