Use r tutorial part1, introduction to sparkr

•Download as PPTX, PDF•

3 likes•7,182 views

Databricks

Presentation given at useR 2016 at http://user2016.org/tutorials/11.html

Technology

Introduction to SparkR
Shivaram Venkataraman, Hossein Falaki

Big Data & R
DataFrames
Visualization
Libraries
Data+

Big Data & R: Challenges
Data access
HDFS, Hive
Capacity
Single machine
memory Parallelism
Single Thread

Apache Spark
Engine for large-scale data processing
Fast, Easy to Use
Runs Everywhere
EC2, clusters, laptop etc.

Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
SparkR

Big Data & R: Patterns
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning

1. Big Data, Small Learning
Data
Cleaning
Filtering
Aggregation
Collect
Subset
DataFrames
Visualizatio
n
Libraries

1. Big Data, Small Learning
songs <- read.df(
“songs.json”,
“json”)
newSongs <- filter(
songs,
songs$year > 2000)
ggplot(collect(newSongs))
Data
Cleaning
Filtering
Aggregation
Collect
Subset

2. Partition Aggregate
Data Best
Model
Params
Parameter Tuning

params<-c(1e-3,1e-1,1e2)
data <- read.csv(“t.csv”)
train <- function(prm) {
lm.ridge(“y ~ x+z”,
data, prm)
}
lapply(params, train)
2. Partition Aggregate
Data Best
Model
Params

3. Large Scale Machine Learning
Data Featurize Learning Model

3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(
“t.csv”)
model <- glm(
delay~Distance+Dest,
family = “gaussian”,
data=data)
summary(model)

Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
SparkR:
Unified approach

SparkR DataFrames
people <- read.df(
“people.json”,
“json”)
avgAge <- select(
df,
avg(df$age))
head(avgAge)
Number of data sources
Column Functions, SQL
Support for R UDFs

Large Scale Machine Learning
Integration with MLLib
Key Features
R-like formulas
Model statistics
model <- glm(
a ~ b + c,
data = df)
summary(model)

Partition Aggregate
spark.lapply: Simple, parallel API
Ex: Parameter tuning, Model Averaging
Include existing R packages

SparkR Status
Open source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks,
IBM, Intel, Alteryx etc.
Contributions welcome !

Tutorial Outline
Part 1: Data Exploration
• ETL: Data loading, schema
• Exploration: Filter, clean, aggregate etc.
• Visualization: Integration with ggplot
Part 2: Advanced Analytics (After the break)

Tutorial Setup
Each user gets a dedicated micro cluster
• Cluster is terminated after 1 hour of inactivity
• Multiple users can collaborate on a notebook
Notebooks can be exported/imported
Examples and tutorials in R/Python/Scala
Free online service for learning Apache Spark

Tutorial Setup
Databricks Notebooks
• Interactive workspace
• Markdown + R, Python, Scala, SQL
Sign up at http://databricks.com/ce

Tutorial Setup
Fill out our survey at
tiny.cc/sparkr-user-survey

SparkR
Big data processing from R
DataFrames for ETL, data exploration
Support for advanced analytics

Tutorial Next Steps
Sign up at http://databricks.com/ce
Part 1: tiny.cc/sparkr-tutorial-part1
Fill out our survey at tiny.cc/sparkr-user-survey

What's hot

R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. SparkR’s evolving interface to Apache Spark offers a wide range of APIs and capabilities to Data Scientists and Statisticians. With the release of Spark 2.0, and subsequent releases, the R API officially supports executing user code on distributed data. This is done primarily through a family of apply() functions. In this Data Science Central webinar, we will explore the following: ●Provide an overview of this new functionality in SparkR. ●Show how to use this API with some changes to regular code with dapply(). ●Focus on how to correctly use this API to parallelize existing R packages. ●Consider performance and examine correctness when using the apply family of functions in SparkR. Speaker: Hossein Falaki, Software Engineer -- Databricks Inc.

Parallelize R Code Using Apache Spark

Databricks

These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016. --- Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.

Jump Start into Apache® Spark™ and Databricks

Databricks

This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk: Use cases to store in file systems for use with Apache Spark: - Analyzing a large set of data files. - Doing ETL of a large amount of data. - Applying Machine Learning & Data Science to a large dataset. - Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Databricks

In Spark SQL’s Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor, cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans. We added a Cost-Based Optimizer framework to Spark SQL engine. In our framework, we use Analyze Table SQL statement to collect the detailed column statistics and save them into Spark’s catalog. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. Also, we save the data distribution of columns in either equal-width or equal-height histograms in order to deal with data skew effectively. Furthermore, with the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of join operation and multi-column group-by operation. In our framework, we compute the cardinality and output size of each database operator. With reliable statistics and derived cardinalities, we are able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), adjusting multi-way join order, etc. In this talk, we will show Spark SQL’s new Cost-Based Optimizer framework and its performance impact on TPC-DS benchmark queries.

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...

Spark Summit

Operational Tips for Deploying Spark

Databricks

“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer" https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Databricks

In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release. Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Databricks

Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.

From Pipelines to Refineries: Scaling Big Data Applications

Databricks

Spark Under the Hood - Meetup @ Data Science London

Databricks

Persisting data from Amazon Kinesis using Amazon Kinesis Firehose is a popular pattern for streaming projects. However, building real-time analytics on these data introduces challenges, including managing the format, size and frequency of the files created. This session will present an end-to-end use case for deploying machine learning streaming analytics at-scale using Structured Streaming on Databricks. We will deploy a high-volume Kinesis producer, persist the data to S3 using Kinesis Firehose, partition and write the data using Parquet, create a machine learning model and, finally, query and visualize the data in real time. Key takeaways include: – Create a Kinesis producer – Persist to S3 using Kinesis Firehose – ETL, machine learning, and exploratory data analysis using Structured Streaming

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...

Databricks

Introduction to Spark (Intern Event Presentation)

Databricks

As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API’s capabilities and discuss future directions including easy sessionization and event-time-based windowing.

Structuring Spark: DataFrames, Datasets, and Streaming

Databricks

Stanford CS347 Guest Lecture: Apache Spark

Reynold Xin

With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks

SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query. Speaker: Sameer Agarwal This talk was originally presented at Spark Summit East 2017.

SparkSQL: A Compiler from Queries to RDDs

Databricks

"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved. What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data? What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output? When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency? How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions. These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."

Designing Structured Streaming Pipelines—How to Architect Things Right

Databricks

Spark streaming State of the Union - Strata San Jose 2015

Databricks

Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: What’s new in Spark 2.0 SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

Jump Start with Apache Spark 2.0 on Databricks

Databricks

End-to-end Data Pipeline with Apache Spark

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Enabling exploratory data science with Spark and R

Databricks

What's hot (20)

Parallelize R Code Using Apache Spark

Jump Start into Apache® Spark™ and Databricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...

Operational Tips for Deploying Spark

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

From Pipelines to Refineries: Scaling Big Data Applications

Spark Under the Hood - Meetup @ Data Science London

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...

Introduction to Spark (Intern Event Presentation)

Structuring Spark: DataFrames, Datasets, and Streaming

Stanford CS347 Guest Lecture: Apache Spark

A Journey into Databricks' Pipelines: Journey and Lessons Learned

SparkSQL: A Compiler from Queries to RDDs

Designing Structured Streaming Pipelines—How to Architect Things Right

Spark streaming State of the Union - Strata San Jose 2015

Jump Start with Apache Spark 2.0 on Databricks

End-to-end Data Pipeline with Apache Spark

Enabling exploratory data science with Spark and R

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Sawtooth Windows for Feature Aggregations

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Re-imagine Data Monitoring with whylogs and Spark

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Massive Data Processing in Adobe Using Delta Lake

Databricks

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Manulife - Insurer Innovation Award 2024

The Digital Insurer

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?

How to Troubleshoot Apps for the Modern Connected Worker

Apidays New York 2024 - The value of a flexible API Management solution for O...

Manulife - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Domino Admins Adventures (Engage 2024)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Boost PC performance: How more available memory can improve productivity

GenAI Risks & Security Meetup 01052024.pdf

Artificial Intelligence: Facts and Myths

Data Cloud, More than a CDP by Matt Robison

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

🐬 The future of MySQL is Postgres 🐘

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Boost Fertility New Invention Ups Success Rates.pdf

presentation ICT roal in 21st century education

Use r tutorial part1, introduction to sparkr

1. Introduction to SparkR Shivaram Venkataraman, Hossein Falaki

2. Big Data & R DataFrames Visualization Libraries Data+

3. Big Data & R: Challenges Data access HDFS, Hive Capacity Single machine memory Parallelism Single Thread

4. Apache Spark Engine for large-scale data processing Fast, Easy to Use Runs Everywhere EC2, clusters, laptop etc.

5. Speed Scalable Flexible Statistics Visualization DataFrames SparkR

6. Big Data & R: Patterns Big Data Small Learning Partition Aggregate Large Scale Machine Learning

7. 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset DataFrames Visualizatio n Libraries

8. 1. Big Data, Small Learning songs <- read.df( “songs.json”, “json”) newSongs <- filter( songs, songs$year > 2000) ggplot(collect(newSongs)) Data Cleaning Filtering Aggregation Collect Subset

9. 2. Partition Aggregate Data Best Model Params Parameter Tuning

10. params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”) train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm) } lapply(params, train) 2. Partition Aggregate Data Best Model Params

11. 3. Large Scale Machine Learning Data Featurize Learning Model

12. 3. Large Scale Machine Learning Data Featurize Learning Model training <- read.csv( “t.csv”) model <- glm( delay~Distance+Dest, family = “gaussian”, data=data) summary(model)

13. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning SparkR: Unified approach

14. SparkR DataFrames people <- read.df( “people.json”, “json”) avgAge <- select( df, avg(df$age)) head(avgAge) Number of data sources Column Functions, SQL Support for R UDFs

15. Large Scale Machine Learning Integration with MLLib Key Features R-like formulas Model statistics model <- glm( a ~ b + c, data = df) summary(model)

16. Partition Aggregate spark.lapply: Simple, parallel API Ex: Parameter tuning, Model Averaging Include existing R packages

17. SparkR Status Open source -- Part of Apache Spark > 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc. Contributions welcome !

18. Tutorial Outline Part 1: Data Exploration • ETL: Data loading, schema • Exploration: Filter, clean, aggregate etc. • Visualization: Integration with ggplot Part 2: Advanced Analytics (After the break)

19. Tutorial Setup Each user gets a dedicated micro cluster • Cluster is terminated after 1 hour of inactivity • Multiple users can collaborate on a notebook Notebooks can be exported/imported Examples and tutorials in R/Python/Scala Free online service for learning Apache Spark

20. Tutorial Setup Databricks Notebooks • Interactive workspace • Markdown + R, Python, Scala, SQL Sign up at http://databricks.com/ce

21. Tutorial Setup Fill out our survey at tiny.cc/sparkr-user-survey

22. SparkR Big data processing from R DataFrames for ETL, data exploration Support for advanced analytics

23. Tutorial Next Steps Sign up at http://databricks.com/ce Part 1: tiny.cc/sparkr-tutorial-part1 Fill out our survey at tiny.cc/sparkr-user-survey

Use r tutorial part1, introduction to sparkr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Use r tutorial part1, introduction to sparkr

Similar to Use r tutorial part1, introduction to sparkr (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Use r tutorial part1, introduction to sparkr