data science toolkit 101: set up Python, Spark, &amp; Jupyter

Data science toolkit 101. set up Python, Spark, & Jupyter on a Mac laptop

Technologie

IBM Cloud Data Services
data science toolkit 101
set up Python, Spark, & Jupyter
Raj Singh, PhD
Developer Advocate: Geo | Open Data
rrsingh@us.ibm.com
http://ibm.biz/rajrsingh
twitter: @rajrsingh

@rajrsingh
IBM Cloud Data Services
Agenda
• Installation
• Python
• Spark
• Pixiedust
• Examples

@rajrsingh
IBM Cloud Data Services
IBM Analytics
Data Science Experience (DSX)

@rajrsingh
IBM Cloud Data Services
What is Spark?
• In-memory Hadoop
• Hadoop was massively scalable but slow
• “Up to 100x faster” (10x faster if memory is exhausted)
• What is Hadoop?
• HDFS: fault-tolerant storage using horizontally scalable commodity hardware
• MapReduce: programming style for distributed processing
• Presents data as an object
independent of the
underlying storage

@rajrsingh
IBM Cloud Data Services
Spark abstracted storage
• Scala
• PySpark = (Spark + Python)
• Drivers
• File storage
• Cloudant
• dashDB
• Cassandra
• …

@rajrsingh
IBM Cloud Data Services
Python installation with miniconda
1. https://www.continuum.io/downloads (choose version 2.7)
2. Miniconda2 install into this location: /Users/<username>/miniconda2
3. bash$ conda install pandas jupyter matplotlib
4. bash$ which python
/Users/<username>/miniconda2/bin/python
https://dzone.com/refcardz/apache-spark

@rajrsingh
IBM Cloud Data Services
Spark installation
• http://spark.apache.org/downloads.html
• Spark release: 1.6.2
• package type: Pre-built for Hadoop 2.6
• mkdir dev
• cd dev
• tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz
• ln -s spark-1.6.2-bin-hadoop2.6 spark
• mkdir dev/notebooks

$@rajrsingh IBM Cloud Data Services PySpark configuration • create directory ~/.ipython/kernels/pyspark1.6/ • create file kernel.json • cd ~/dev/spark/conf • cp spark-defaults.conf.template spark-defaults.conf • add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/* { "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" } }$

@rajrsingh
IBM Cloud Data Services
PySpark test
• bash$ cd ~/dev
• bash$ jupyter notebook
• upper right of the Jupyter screen, click New, choose
pySpark (Spark 1.6.2) Python 2
(or whatever name specified in your kernel.json file)
• in the notebook's first cell enter sc.version
and click the >| button to run it (or hit CTRL + Enter).

@rajrsingh
IBM Cloud Data Services
Pixiedust installation
• cd ~/dev
• git clone https://github.com/ibm-cds-labs/pixiedust.git
• pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust
• pip install maven-artifact
• pip install mpld3

@rajrsingh
IBM Cloud Data Services
Examples
• Pixiedust
• https://github.com/ibm-cds-labs/pixiedust
• Demographic analyses
• http://ibm-cds-labs.github.io/open-data/samples/
• or https://github.com/ibm-cds-labs/open-data/tree/master/samples

IBM Cloud Data Services
Raj Singh
Developer Advocate: Geo | Open
Data
rrsingh@us.ibm.com
http://ibm.biz/rajrsingh
Twitter: @rajrsingh
LinkedIn: rajrsingh
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable Data Science with SparkR

DataWorks Summit

Spark Under the Hood - Meetup @ Data Science London

Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity. In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.

Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray

ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors. Speakers: Kyle Pistor & Miklos Christine This talk was originally presented at Spark Summit East 2017.

Keeping Spark on Track: Productionizing Spark for ETL

If you are running Apache Spark in cloud environments, Object Stores —such as Amazon S3 or Azure WASB— are a core part of your system. What you can’t do is treat them like “just another filesystem” —do that and things will, eventually, go horribly wrong. This talk looks at the object stores in the cloud infrastructures, including underlying architectures., compares them to what a “real filesystem” is expected to do and shows how to use object stores efficiently and safely as sources of and destinations of data. It goes into depth on recent “S3a” work, showing how including improvements in performance, security, functionality and measurement —and demonstrating how to use make best use of it from a spark application. If you are planning to deploy Spark in cloud, or doing so today: this is information you need to understand. The performance of you code and integrity of your data depends on it.

How To Connect Spark To Your Own Datasource

MongoDB

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...

Spark Summit

Apache Kylin is a distributed OLAP engine on Hadoop, which provides sub-second level query latency over datasets scaling to petabytes. Kylin’s superior query performance relies on pre-calculated multi-dimension Cube, which is often time-consuming to build. By default, Kylin uses MapReduce Cube Engine built atop of Hadoop MapReduce framework to aggregate huge amounts of source data. The MR Engine has been well-tuned over years and proven to be stable in hundreds of production deployments. Recently, the Kylin team is trying to further speed up the process of cube building by replacing MR with Spark. Kyligence has initiated the new Spark Cube Engine with some benchmarks between Spark and MR over different datasets, and has received some promising results. Hear about their results and experiences on moving Cube building, which is a huge computing task, to Spark.

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Operational Tips for Deploying Spark

Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.

ETL with SPARK - First Spark London meetup

Rafal Kwasny

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Spark Summit

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

Programming in Spark using PySpark

Mostafa

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure. GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Spark Meetup at Uber

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands On Labs You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL. Level: Beginner to intermediate, not for advanced Spark users. Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional. Bio: Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Jump Start on Apache® Spark™ 2.x with Databricks

XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost. While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment. We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost. The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events. Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.

Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...

Data scientists write SQL queries everyday. Very often they know how to write correct queries but don’t know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene. In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark. Specifically: – Why and how do we migrate? – How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift – How do we scale the team on Spark (with 80+ people in our data science team)

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...

Spark Summit

There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem. In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...

Apache Arrow and Pandas UDF on Apache Spark

Takuya UESHIN

Was ist angesagt? (20)

Scalable Data Science with SparkR

Spark Under the Hood - Meetup @ Data Science London

Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray

Keeping Spark on Track: Productionizing Spark for ETL

How To Connect Spark To Your Own Datasource

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Operational Tips for Deploying Spark

ETL with SPARK - First Spark London meetup

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Programming in Spark using PySpark

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Spark Meetup at Uber

Jump Start on Apache® Spark™ 2.x with Databricks

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...

Apache Arrow and Pandas UDF on Apache Spark

Andere mochten auch

Cassandra and Spark

datastaxjp

Introduction to Apache Spark

Juan Pedro Moreno

Presentation of Apache Cassandra

Nikiforos Botis

Introduction to Cassandra - Denver

From the original abstract: If you're already using Cassandra you're already aware of it’s strengths of high availability and linear scalability. The downside to this power is less query flexibility. For an OLTP system with an SLA this is an acceptable tradeoff, but for a data scientist it’s extremely limiting. Enter Apache Spark. Apache spark complements an existing Cassandra cluster by providing a means of executing arbitrary queries, filters, sorting and aggregation. It’s possible to use functional constructs like map, filter, and reduce, as well as SQL and DataFrames. In this presentation I’ll show you how to process Cassandra data in bulk or through a Kafka stream using Python. Then we’ll visualize our data using iPython notebooks, leveraging Pandas and matplotlib. This is an advanced talk. We will assume existing knowledge of Cassandra and CQL.

Cassandra Basics: Indexing

Benjamin Black

Developers summit cassandraで見るNoSQL

Ryu Kobayashi

Intro to py spark (and cassandra)

Diagnosing Problems in Production: Cassandra Summit 2014

Python & Cassandra - Best Friends

Intro to Cassandra

Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.

The Cassandra Distributed Database

Eric Evans

PySpark Cassandra - Amsterdam Spark Meetup

Frens Jan Rumph

Parquet overview

Julien Le Dem

Cassandra Summit 2010 Performance Tuning

driftx

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

DataStax Academy

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Dave Gardner

Data analysis with Pandas and Spark

Felix Crisan

Spark, Python and Parquet

odsc

Python performance profiling

http://www.agildata.com/agildata-hosts-big-data-meetup-featuring-apache-spark/ Slides for talks given at the Denver Java Users Group, Boulder Java Users Group, Denver/Boulder Big Data Users Group. Dan and Andy will spend an evening rolling up our sleeves with you to try out some real-world use cases for Apache Spark. We’ll cover Spark’s RDD API, the DataFrame API, as well as the brand new Dataset API.

Cassandra concepts, patterns and anti-patterns

Dave Gardner

Andere mochten auch (20)

Cassandra and Spark

Introduction to Apache Spark

Presentation of Apache Cassandra

Introduction to Cassandra - Denver

Cassandra Basics: Indexing

Developers summit cassandraで見るNoSQL

Intro to py spark (and cassandra)

Diagnosing Problems in Production: Cassandra Summit 2014

Python & Cassandra - Best Friends

Intro to Cassandra

The Cassandra Distributed Database

PySpark Cassandra - Amsterdam Spark Meetup

Parquet overview

Cassandra Summit 2010 Performance Tuning

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data analysis with Pandas and Spark

Spark, Python and Parquet

Python performance profiling

Cassandra concepts, patterns and anti-patterns

Ähnlich wie data science toolkit 101: set up Python, Spark, & Jupyter

Hands on with Apache Spark

Dan Lynn

Dask: Scaling Python

Matthew Rocklin

Apache Spark for Everyone - Women Who Code Workshop

Amanda Casari

PYSPARK PROGRAMMING.pdf

MuhammadFauzi713466

Paris Data Geek - Spark Streaming

Djamel Zouaoui

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Olalekan Fuad Elesin

This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames. In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames in Databricks Community Edition. Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it. * Apache Spark Basics & Architecture * Spark SQL * DataFrames * Brief Overview of Databricks Certified Developer for Apache Spark

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration? In this talk we will cover: - Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven. - Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js. - Easy deployment of the Hadoop stack to the cloud. - Hermes - our homegrown command-line tool which helps us automate data-related tasks. - Examples of exciting machine learning challenges that we are currently tackling - Hadoop with Azure and Microsoft stack.

Apache Spark Tutorial

Ahmet Bulut

Introduction to Apache Spark

Rahul Jain

DUG'20: 02 - Accelerating apache spark with DAOS on Aurora

Andrey Kudryavtsev

Hadoop in Practice (SDN Conference, Dec 2014)

Marcel Krcah

Apache Spark is the engine powering many data-driven use cases, from data engineering to data science and machine learning applications. At QuantumBlack, Spark is considered a key technology and used in a number of client engagements, from a Data Engineering, Data Science and Platform Engineering point of view. This talk will be around the lessons learned after running successfully Apache Spark workloads in production in the cloud for a number of years. As public cloud adoption grows in the enterprise, more and more organizations are choosing to run Apache Spark workloads on cloud infrastructure. While the cloud presents many benefits, there are a number of challenges that aren’t obvious until you start and require sometimes different approaches or thinking. This talk will look into a few different areas, starting with the Jigsaw pieces you face with Open Source software, balancing a platform for stability along with allowing innovation. The talk will then look at approaches used to combat the not so obvious challenges and trade-offs of using cloud scalable storage backends for storing/retrieving data. Finally, there’ll be a section on the considerations needed for reliability and manageability of robust analytic pipelines.

Running Spark In Production in the Cloud is Not Easy with Nayur Khan

Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.

Ingesting hdfs intosolrusingsparktrimmed

whoschek

http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.

Intro to Apache Spark by CTO of Twingo

MapR Technologies

Intro to Apache Spark

Robert Sanders

Intro to Apache Spark

clairvoyantllc

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Euangelos Linardos

HDPCD Spark using Python (pyspark)

Durga Gadiraju

Apache Spark™ is a multi-language engine for executing data-S5.ppt

bhargavi804095

PixieDust

Margriet Groenendijk

Ähnlich wie data science toolkit 101: set up Python, Spark, & Jupyter (20)

Hands on with Apache Spark

Dask: Scaling Python

Apache Spark for Everyone - Women Who Code Workshop

PYSPARK PROGRAMMING.pdf

Paris Data Geek - Spark Streaming

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Apache Spark Tutorial

Introduction to Apache Spark

DUG'20: 02 - Accelerating apache spark with DAOS on Aurora

Hadoop in Practice (SDN Conference, Dec 2014)

Running Spark In Production in the Cloud is Not Easy with Nayur Khan

Ingesting hdfs intosolrusingsparktrimmed

Intro to Apache Spark by CTO of Twingo

Intro to Apache Spark

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

HDPCD Spark using Python (pyspark)

Apache Spark™ is a multi-language engine for executing data-S5.ppt

PixieDust

Mehr von Raj Singh

Open data is on the rise, and freely available data sets - like municipal data - can bring huge value and new features to applications, at no upfront cost. For example, open crime data can be leveraged to support features that make location based apps safer for users. However, the challenge for developers is to efficiently wrangle, store and deliver open data sets without having to build complex data architectures. Using the example of crime data applied to a new safety feature for Pokemon Go, this webinar will demonstrate how to harvest open geo data, store it natively in a cloud database (Cloudant), and ensure it is highly available for applications and analytics platforms.

Optimizing location-based apps with open data

Wouldn’t it be cool if every database could look like a FeatureService? Well that’s the promise of Koop (https://koopjs.github.io/), an open source effort to provide a standard REST API for web-based sources of vector geodata such as ArcGIS Online, Socrate, GitHub and Gist. Koop was started within Esri, but has a wide and varied community of contributors. This talk is about IBM’s work to develop a Koop “provider” for Cloudant, a JSON NoSQL document store.

All your database are belong to us - Koop, Cloudant, Feature Services

Managing geospatial data has long been owned by relational database technology, but with recent advances in spatial indexing support in NoSQL databases, building spatially-aware apps with non-relational technologies is fast, performant and scalable. In this session we introduce spatial functionality in Apache Lucene and IBM Cloudant. Cloudant Geo supports Lucene spatial search as well as a more advanced geospatial search and indexing capability using GeoJSON documents, which allows for more types of spatial objects along with spatio-temporal indexing and query. To illustrate the powerful capabilities of the geospatial platform, we'll demonstrate a sample application called Field Work. This is a pure Javascript/HTML5 mobile app for utilities maintenance people to do remote work -- using the GeoJSON format to update the spatial layout of infrastructure such as pipes or meters, and even creating work orders -- whether online or offline.

Field Work: Map-centric mobile apps with Cloudant Geo and LeafletJS

Painless Polyglot Persistence

Location services is one of the most interesting areas of web development today. Technologies that were once too complex and esoteric to be used by anyone outside of NASA or the military became standard offerings for the likes of Hotels.com and Foursquare, and are now being regularly deployed by even the smallest startups. In this presentation I track the evolution of mobile mapping and talk about the rich capabilities around location now available to even casual mobile developers around mapping, demographics and location-aware customer insights.

The Evolution of Mobile Mapping

The NoSQL Geospatial Landscape

JSON Everywhere

GeoPackage, OWS Context and the OGC Interoperability Program

IoT Meets Geo

GeoPackage, Context and POI (and a sprinkle of GeoJSON)

GeoPackage is the modern alternative to formats like SDTS and Shapefile. At it’s core, GeoPackage is simply a SQLite database schema. If you know SQLite, you are close to knowing GeoPackage. Install Spatialite – the premiere spatial extention to SQLite – and you get all the performance of a spatial database along with the convenience of a file-based data set that can be emailed, shared on a USB drive or burned to a DVD. A ‘context document’ specifies a fully configured service set which can be exchanged (with a consistent interpretation) among clients supporting the standard. The OGC Web Services Context Document (OWS Context) was created to allow a set of configured information resources (service set) to be passed between applications primarily as a collection of services. OWS Context is developed to support in-line content as well. The goal is to support use cases such as the distribution of search results, the exchange of a set of resources such as OGC Web Feature Service (WFS), Web Map Service (WMS), Web Map Tile Service (WMTS), Web Coverage Service (WCS) and others in a ‘common operating picture’. Additionally OWS Context can deliver a set of configured processing services (Web Processing Service (WPS)) parameters to allow the processing to be reproduced on different nodes.

Introduction to GeoPackage and OWS Context