A Perfect Hive Query for a Perfect Meeting

•

4 gefällt mir•491 views

DataWorks Summit

Technologie Business

Adam Kawa

A deal was made!

Martin will invite Adam
and Timbuktu, my favourite Swedish artist,
for a beer or coke or whatever to drink *
by Martin

Question

Question
Answers

Data will tell the truth!

-
-
-
-
-

Why?
by Adam

-
-
-
-
-
-

Introduction

… …
… …
… …
… …
… …

…
…
…
…

✓
✓
✓
✗
✗

HiveQL

A line where
I may have a bug ? !

HiveQL

Verbose
and
complex
Java
code

-
-

-
-

-
-
-

-
-
-
-

-
-
-
-
-

For Each Line
-

For Each Line
-
-

track.txt

user.txt
track.txt

stream.txt
user.txt
track.txt

expected.txt
stream.txt

…
…

…
…
…

…
…
…

Bee test
Be happy !

HiveQL

-
-
-

✗
-
✗
-

… … …
✗

-
-
-

Threshold

✓
-
Threshold

✗
Threshold

✗
Try and see
-

-
?

HiveQL

-
-
-
-

2 MapReduce
job in total

Runs many Map
joins in a Map-Only
job
[HIVE-3784]

-
-
-

-
-
-
-

-
-
-
-
-
-
-

-
-
-
-
-
-
-

-
-
-

Runs as a single
MR job
[HIVE-3952]

2 MapReduce
job in total

HiveQL

✗
✓
-
-

✓
✗
-

-
My query generates small amount of
intermediate data
-

✓
✗

-
-
-
-
-
-

-
-
-
-
-
-
-
-

2 months of data
50 min 2 sec
10th place
?

Changes are needed!

File Format

-
-

✓
-
-

✗
✓
-

16x

3.5x

32x

Computation

-
-
-

1.4x
2.4x

✓
-
-
-

✓
-
-

✓
-
-

Time

The more congested
queue/cluster, the bigger
benefits of reusing
Time

No scheduling overhead to
run new Reduce task
Time

Time
Thinner tasks allows to
avoid stragglers

Finished within 1,5 sec.
Warm !

-
-
-
-

✓
✓

-
-
-
-

Feature

✓
-
✓
✓
-
✓

1.4x

✗
✓
✓
✗
✓

-
-

Feature

✓
✓
-
-
-
✓

14 months of data
10 min 11 sec
?

Results

That’s all !

-
-
-
-
-

-
-
-

A Perfect Hive Query for a Perfect Meeting

A Perfect Hive Query for a Perfect Meeting

Weitere ähnliche Inhalte

Mehr von DataWorks Summit

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL). Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW). Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models. Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Data Science Crash Course

Data Science Crash Course

Data Science Crash Course

DataWorks Summit

In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort. This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

DataWorks Summit

Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase. Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs. Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables. Resources: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

DataWorks Summit

Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

DataWorks Summit

LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

DataWorks Summit

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Managing the Dewey Decimal System

DataWorks Summit

Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL. Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist). In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

DataWorks Summit

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber. Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable. At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads. At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

DataWorks Summit

Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms. To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

DataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

DataWorks Summit

In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

DataWorks Summit

At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

DataWorks Summit

Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

DataWorks Summit

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as: ● Optimizing merchandising execution, in-stocks and sell-thru ● Enhancing operational efficiencies, enable real-time customer engagement ● Enhancing loss prevention capabilities, response time ● Creating frictionless experiences for shoppers Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry. We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey. Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables. We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance. We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing. Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems. By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

DataWorks Summit

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

DataWorks Summit

Mehr von DataWorks Summit (20)

Data Science Crash Course

Data Science Crash Course

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Rustici Software

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Modernizing Securities Finance: The cloud-native prime brokerage platform transforming capital markets. Madhu Subbu, Managing Director, Head of Securities Finance Engineering Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

A Beginners Guide to Building a RAG App Using Open Source Milvus

A Beginners Guide to Building a RAG App Using Open Source Milvus

A Beginners Guide to Building a RAG App Using Open Source Milvus

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

A Beginners Guide to Building a RAG App Using Open Source Milvus

A Beginners Guide to Building a RAG App Using Open Source Milvus

A Beginners Guide to Building a RAG App Using Open Source Milvus

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

A Perfect Hive Query for a Perfect Meeting

2. A deal was made!

3.

4.

5.

6. Martin will invite Adam and Timbuktu, my favourite Swedish artist, for a beer or coke or whatever to drink * by Martin

9. Question Answers

10. Data will tell the truth!

12. Why? by Adam

13. - - - - - -

14. Introduction

15.

16. … … … … … … … … … …

17. … … … …

18. -

19. ✓ ✓ ✓ ✗ ✗

21.

22.

23.

24.

25.

26.

27.

28.

29.

30. A line where I may have a bug ? !

32. Verbose and complex Java code

33. -

35. -

40.

41. For Each Line -

42. For Each Line - -

44. user.txt track.txt

45. stream.txt user.txt track.txt

46. expected.txt stream.txt

47.

48.

50. … … …

51. … … …

52. Bee test Be happy !

55. ✗ - ✗ -

56. … … … ✗

59. ✓ - Threshold

60. ✗ Threshold

61. ✗ Try and see -

64. -

66. 2 MapReduce job in total

67. Runs many Map joins in a Map-Only job [HIVE-3784]

70. - - - - - - -

71. - - - - - - -

73. Runs as a single MR job [HIVE-3952]

74. 2 MapReduce job in total

76.

77. ✗ ✓ - -

79.

80. -

81. - My query generates small amount of intermediate data -

83.

84.

85. -

86. - - - - - -

87. - - - - - - - -

88. 2 months of data 50 min 2 sec 10th place ?

89. Changes are needed!

90. File Format

97. -

98. Computation

101. 8x

106. The more congested queue/cluster, the bigger benefits of reusing Time

107. No scheduling overhead to run new Reduce task Time

108. Time Thinner tasks allows to avoid stragglers

109. Finished within 1,5 sec. Warm !

110.

112. -

116. ✓ - ✓ ✓ - ✓

118. ✗ ✓ ✓ ✗ ✓

121. ✓ ✓ - - - ✓

122.

123. 14 months of data 10 min 11 sec ?

125.

126.

127. That’s all !