Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop

•

2 likes•1,901 views

This document outlines a presentation on analyzing large raster data in a Jupyter notebook with GeoPySpark on AWS. The presentation covers introductory material, exercises on working with land cover and Landsat imagery data, combining data layers to detect crop cycles, and combining different data types to create maps. It discusses where the notebooks are running, data sources, and GeoPySpark capabilities like working with space-time raster data. Attendees are encouraged to tweet maps created during the exercises.

Technology

Rob Emanuele @lossyrob
ANALYZING LARGE RASTER DATA
IN A JUPYTER NOTEBOOK
WITH GEOPYSPARK
ON AWS

Connect to the WIFI
Network: Harvard University
http://getonline.harvard.edu
Click “I am a guest”
Credentials:
U: foss4g2017@gmail.com
P: 7RFQU3rm
FIRST:
Find your Jupyter Notebook URL
https://git.io/v77lh
(lowercase L)
visit the URL next to your name
Log in to the Jupyter Hub
U: hadoop
P: hadoop

OUTLINE
8:00 - 8:30 Intro and Background
8:30 - 9:10 Section 1: Land Cover data
9:10 - 10:00 Section 2: Landsat 8 data
10:00 - 10:10 BREAK
10:10 - 10:30 Deployment and Ingestion
10:30 - 11:10 Section 3: Combining data layers
11:10 - 12:00 Section 4: Making Cool Maps

rdd.map(lambda x: x + 1)
Source: http://silverpond.com.au/2016/10/06/balancing-spark.ht

(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)

(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3

(1, 1) (2, 1)(0, 1)
Node 1
Node 2
Node 3

(1, 1) (2, 1)(0, 1)
Node 1
Node 2
Node 3
rdd.buﬀerTiles(…)

+
+
Interactive and Batch Processing
of large raster data
Web-Speed Processing
of small to medium sized raster data

GeoTrellis Ecosystem
Raster Foundry by
Spark SQL and Spark ML support
Raster Frames by
Spark SQL and Spark ML support
GeoPySpark
Python bindings
Vector Pipes
Vector Tiles on Spark
PDAL integration
Point Clouds on Spark

Started December 2016
Follows PySpark’s model of communication
between the JavaVirtual Machine and Python
Access GeoTrellis functionality through Python,
and integrates with your favorite python raster
tools (numpy + friends).
0.2 is released!
GeoPySpark

EXERCISE 2:
WORKING WITH LANDSAT IMAGERY
AND NDVITHROUGHTIME

(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
…
SpaceTimeKey ≈ (col, row, instant)

(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
…
lambda
lambda
lambda
(SpatialKey, (DateTime, Tile))
(SpatialKey, (DateTime, Tile))
(SpatialKey, (DateTime, Tile))
…

…
(SpatialKey, [(DateTime, Tile)
(DateTime, Tile)])
(SpatialKey, (DateTime, Tile))
(SpatialKey, (DateTime, Tile))
(SpatialKey, (DateTime, Tile))
(SpatialKey, [(DateTime, Tile)])
…

(SpatialKey, [(DateTime, Tile)
(DateTime, Tile)])
(SpatialKey, [(DateTime, Tile)])
…
mosaic
(SpatialKey, Tile)
(SpatialKey, Tile)
…
mosaic

WHERE AND HOW ARETHESE
NOTEBOOKS RUNNING?

EXERCISE 3:
COMBINING LAND COVER AND NDVITO
DETECT CROP CYCLES

(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
…

(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
…
map_to_spatial
(SpatialKey, (STK, Tile))
(SpatialKey, (STK, Tile))
(SpatialKey, (STK, Tile))
…
map_to_spatial
map_to_spatial
STK = SpaceTimeKey

(SpatialKey, (STK, Tile))
(SpatialKey, (STK, Tile))
(SpatialKey, (STK, Tile))
…
(SpatialKey, Tile)
(SpatialKey, Tile)
…
ndwi_rdd
nlcd_layer.to_numpy_rdd()
(SpatialKey, ((STK, Tile), Tile))
(SpatialKey, ((STK, Tile), Tile))
(SpatialKey, ((STK, Tile),Tile))
…

mask_ndwi
mask_ndwi
mask_ndwi
(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
(SpaceTimeKey, Tile)
…
(SpatialKey, ((STK, Tile), Tile))
(SpatialKey, ((STK, Tile), Tile))
(SpatialKey, ((STK, Tile),Tile))
…

EXERCISE 4:
COMBINING IMAGERY, ELEVATION AND
LAND COVER DATA
TO MAKE A COOL LOOKING MAP

EXERCISE 4:
COMBINING IMAGERY, ELEVATION AND
LAND COVER DATA
TO MAKE A COOL LOOKING MAP
TWEETYOUR SWEET MAP SCREENSHOTS WITH
#GEOPYSPARK #FOSS4G!

What's hot

SkyhookDM - Towards an Arrow-Native Storage System

JayjeetChakraborty

The San Diego Supercomputer Center (SDSC) and the Wisconsin IceCube Particle Astrophysics Center (WIPAC) at the University of Wisconsin–Madison successfully completed a computational experiment as part of a multi-institution collaboration that marshalled all globally available for sale GPUs (graphics processing units) across Amazon Web Services, Microsoft Azure, and the Google Cloud Platform. In all, some 51,500 GPU processors were used during the approximately 2-hour experiment conducted on November 16 and funded under a National Science Foundation EAGER grant. The experiment – completed just prior to the opening of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC19) in Denver, CO – was coordinated by Frank Würthwein, SDSC Lead for High-Throughput Computing, and Benedikt Riedel, Computing Manager for the IceCube Neutrino Observatory and Global Computing Coordinator at WIPAC. Igor Sfiligoi, SDSC’s lead scientific software developer for high-throughput computing, and David Schultz, a production software manager with IceCube, conducted the actual run. This presentation was given at several booths during SC19 by Frank Würthwein.

Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...

Igor Sfiligoi

Presented at PEARC20. This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

Igor Sfiligoi

"Building and running the cloud GPU vacuum cleaner"

Frank Wuerthwein

NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...

Igor Sfiligoi

Burst data retrieval after 50k GPU Cloud run

Igor Sfiligoi

Data-intensive IceCube Cloud Burst

Igor Sfiligoi

Locality Sensitive Hashing By Spark

Spark Summit

A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones? Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers. Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark). In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.

A Recommender Story: Improving Backend Data Quality While Reducing Costs

Databricks

R user group 2011 09

MapR Technologies

For the past decade, feature-engineering-based approaches applied to the discovery of transients and the characterization of tens of thousands of variable stars led the way to novel astronomical inference. Here I will show that new auto-encoder recurrent neural network architectures, without hand-crafted features, rival those traditional methods. Autonomous discovery and inference are part of a larger worldwide onus to federate precious (and heterogeneous) follow-up resources to maximize our collective scientific returns.

Autoencoding RNN for inference on unevenly sampled time-series data

Joshua Bloom

This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python. System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

StampedeCon

Global Grid of Grapes

Derek Groen

Playing the Snake Game with Deep Reinforcement Learning (by Chuyang Liu)

Chuyang Liu

Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators. More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki – Speaker – Jonathan Wolff Founder, Director of Engineering, Isentropy LLC Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig. Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's. — More Information — For more information see http://www.accumulosummit.com/

Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...

Accumulo Summit

** These slides were presented at Strata London 2017: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57647 The first step in any data science project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, carrying it out usually means repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project. Víctor Zabalza shares a Python package based on dask execution graphs and interactive visualisation in Jupyter widgets built to overcome this drudge work, enabling efficient data exploration and kickstarting data science projects. The tool generates a summary for each dataset that includes general information about the dataset, including data quality of each of the columns; the distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; a 2D distribution between pairs of columns; and a correlation coefficient matrix for all numerical columns. Víctor explains how building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a data frame within a dask custom execution graph to interactive visualization with Jupyter widgets and Plotly, and why it will become essential in the first steps of every data science project, cutting down the time data scientists spend making one-use exploratory graphs and getting them more quickly to deriving insights from the data.

Automated Data Exploration: Building efficient analysis pipelines with Dask

ASI Data Science

Space Debris are defunct objects in space, including old space vehicles or fragments from collisions. Space debris can cause great damage to functional space ships and satellites. Thus detection of space debris and prediction of their orbital paths are essential. The talk shows a Python based infrastructure for storing space debris data from sensors and high-throughput processing of that data. PyData Seattle (26. Juli 2015) http://seattle.pydata.org/schedule/presentation/35/

High Throughput Processing of Space Debris Data

Andreas Schreiber

Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy. On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more. On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every night. Given the unprecedented precision of next generation of telescopes, the streams of alerts will be made of millions of alerts per night, and relying on Structured Streaming is a guarantee of not missing the latest Black Hole event in a sea of data! We will also share active learning developments used on top to improve real-time event selection and classification for the LSST telescope. You will walk away with an understanding of modern challenges in astronomy, appreciate some beautiful night skies, and how Apache Spark can help pushing further the frontiers of Science!

Accelerating Astronomical Discoveries with Apache Spark

Databricks

LocationTech GeoMesa is a project that builds on open-source, distributed databases like Accumulo, HBase, and Cassandra to scale up indexing, querying, and analyzing billions of spatio-temporal data points. GeoMesa uses space-filling curves to index multi-dimensional data in Accumulo, and we'll discuss recent improvements for non-point geometries. Over the two and a half years GeoMesa has been an open-source project, GeoMesa's Accumulo schemas have evolved and our team has had a chance to work through creating and optimizing custom Accumulo iterators. These custom iterators allow for better query performance and interesting aggregations. GeoMesa provides support for distributed processing in Spark via MapReduce input and output formats that extend their Accumulo counterparts. We will discuss the performance benefit gained by reducing the number of default map/Spark tasks created for complex query patterns. The talk will conclude with updates about GeoMesa's integration with Jupyter notebook and improvements to GeoMesa's Spark integration. – Speaker – Dr. James Hughes Mathematician, Commonwealth Computer Research, Inc (CCRi) Dr. James Hughes is a mathematician at Commonwealth Computer Research, Inc. in Charlottesville, Virginia. He is a core committer for GeoMesa which leverages Accumulo and other distributed database systems to provide distributed computation and query engines. He is a LocationTech committer for GeoMesa, SFCurve, and GeoBench. He serves on the LocationTech Project Management Committee and Steering Committee. Through work with LocationTech and OSGeo projects like GeoTools and GeoServer, he works to build end-to-end solutions for big spatio-temporal problems. He holds a PhD in algebraic topology from the University of Virginia. — More Information — For more information see http://www.accumulosummit.com/

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...

Accumulo Summit

Artmosphere Demo

Keira Zhou

What's hot (20)

SkyhookDM - Towards an Arrow-Native Storage System

Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

"Building and running the cloud GPU vacuum cleaner"

NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...

Burst data retrieval after 50k GPU Cloud run

Data-intensive IceCube Cloud Burst

Locality Sensitive Hashing By Spark

A Recommender Story: Improving Backend Data Quality While Reducing Costs

R user group 2011 09

Autoencoding RNN for inference on unevenly sampled time-series data

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

Global Grid of Grapes

Playing the Snake Game with Deep Reinforcement Learning (by Chuyang Liu)

Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...

Automated Data Exploration: Building efficient analysis pipelines with Dask

High Throughput Processing of Space Debris Data

Accelerating Astronomical Discoveries with Apache Spark

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...

Artmosphere Demo

Similar to Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop

Processing Geospatial Data At Scale @locationtech

Rob Emanuele

Jupyter notebooks are transforming the way we look at computing, coding, and problem-solving. But is this the only “data scientist experience” that this technology can provide? In this talk, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.

Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017

Codemotion

Q4 2016 GeoTrellis Presentation

Rob Emanuele

Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide? In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks. They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.

Data science apps: beyond notebooks

Natalino Busa

afternoon3.pdf

WinnieChu21

Big Data with Modern R & Spark

Xavier de Pedro

Containers for sensor web services, applications and research @ Sensor Web Co...

Daniel Nüst

Getting Started with Hadoop

Josh Devins

Spinnaker is a continuous delivery platform by Netflix and open sourced in late 2015. Fast-forward 3 years, Spinnaker can deploy to 9 (!) cloud providers and platforms; with many project contributions coming from the cloud providers themselves (Google, Amazon, Microsoft, etc.). This DevOps Toronto talk will feature a quick overview of what Spinnaker can do. http://decks.pierre-nick.com/201904_Spinnaker_DevOpsTO/ https://github.com/pndurette/spinnaker-playground https://github.com/pndurette/decks

An Overview of Spinnaker

Pierre-Nicolas Durette

Mapping with Drupal

leoklein

Paco Nathan, Director of Community Evangelism at Databricks Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

BigDataEverywhere

Slides for a talk at PyCon AU 2013. Integrating PyDAP + WMS + OpenLayers + IPython Notebook. Video: http://www.youtube.com/watch?v=YJqBGi48RAM The IPython Notebook is a powerful web app for exploring ideas and data sets with Python. It has excellent integration with Matplotlib, giving the user highly customisable static plots with ease. But for larger data sets, a static plot may not be ideal - the ability to pan, zoom, choose dynamic layers and sample the data at particular points would be nice. This talk will demonstrate just how easy it is to integrate a Web Map Service/client such as Pydap/Leaflet.js into the IPython Notebook.

Dynamic viz in the IPython Notebook

Brianna Laugher

Strata Stinger Talk October 2013

alanfgates

Hadoop trainingin bangalore

appaji intelhunt

A brief history of "big data"

Nicola Ferraro

This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.

The Fundamentals Guide to HDP and HDInsight

Gert Drapers

TheEdge10 : Big Data is Here - Hadoop to the Rescue

Shay Sofer

OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Paco Nathan

Using Cascalog to build an app with City of Palo Alto Open Data

OSCON Byrum

FOSS4G 2017 - Geonotebook: an extension to the jupyter notebook for explora...

Christopher Kotfila

Similar to Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop (20)

Processing Geospatial Data At Scale @locationtech

Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017

Q4 2016 GeoTrellis Presentation

Data science apps: beyond notebooks

afternoon3.pdf

Big Data with Modern R & Spark

Containers for sensor web services, applications and research @ Sensor Web Co...

Getting Started with Hadoop

An Overview of Spinnaker

Mapping with Drupal

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

Dynamic viz in the IPython Notebook

Strata Stinger Talk October 2013

Hadoop trainingin bangalore

A brief history of "big data"

The Fundamentals Guide to HDP and HDInsight

TheEdge10 : Big Data is Here - Hadoop to the Rescue

OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Using Cascalog to build an app with City of Palo Alto Open Data

FOSS4G 2017 - Geonotebook: an extension to the jupyter notebook for explora...

Recently uploaded

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

Heather Hedden, Senior Consultant at Enterprise Knowledge, presented “The Role of Taxonomy and Ontology in Semantic Layers” at a webinar hosted by Progress Semaphore on April 16, 2024. Taxonomies at their core enable effective tagging and retrieval of content, and combined with ontologies they extend to the management and understanding of related data. There are even greater benefits of taxonomies and ontologies to enhance your enterprise information architecture when applying them to a semantic layer. A survey by DBP-Institute found that enterprises using a semantic layer see their business outcomes improve by four times, while reducing their data and analytics costs. Extending taxonomies to a semantic layer can be a game-changing solution, allowing you to connect information silos, alleviate knowledge gaps, and derive new insights. Hedden, who specializes in taxonomy design and implementation, presented how the value of taxonomies shouldn’t reside in silos but be integrated with ontologies into a semantic layer. Learn about: - The essence and purpose of taxonomies and ontologies in information and knowledge management; - Advantages of semantic layers leveraging organizational taxonomies; and - Components and approaches to creating a semantic layer, including the integration of taxonomies and ontologies

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Enterprise Knowledge

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Evaluating the top large language models.pdf

ChristopherTHyatt

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Presentation on how to chat with PDF using ChatGPT code interpreter

naman860154

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Artificial Intelligence: Facts and Myths

Driving Behavioral Change for Information Management through Data-Driven Gree...

Boost PC performance: How more available memory can improve productivity

Finology Group – Insurtech Innovation Award 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

GenAI Risks & Security Meetup 01052024.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Evaluating the top large language models.pdf

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Strategies for Landing an Oracle DBA Job as a Fresher

Exploring the Future Potential of AI-Enabled Smartphone Processors

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Presentation on how to chat with PDF using ChatGPT code interpreter

Tech Trends Report 2024 Future Today Institute.pdf

GenCyber Cyber Security Day Presentation

Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop

1. Rob Emanuele @lossyrob ANALYZING LARGE RASTER DATA IN A JUPYTER NOTEBOOK WITH GEOPYSPARK ON AWS

2. Connect to the WIFI Network: Harvard University http://getonline.harvard.edu Click “I am a guest” Credentials: U: foss4g2017@gmail.com P: 7RFQU3rm FIRST: Find your Jupyter Notebook URL https://git.io/v77lh (lowercase L) visit the URL next to your name Log in to the Jupyter Hub U: hadoop P: hadoop

3. OUTLINE 8:00 - 8:30 Intro and Background 8:30 - 9:10 Section 1: Land Cover data 9:10 - 10:00 Section 2: Landsat 8 data 10:00 - 10:10 BREAK 10:10 - 10:30 Deployment and Ingestion 10:30 - 11:10 Section 3: Combining data layers 11:10 - 12:00 Section 4: Making Cool Maps

4. NOW: A MOTIVATING EXAMPLE

8. BY

10. rdd.map(lambda x: x + 1) Source: http://silverpond.com.au/2016/10/06/balancing-spark.ht

11.

12.

13.

14.

15. (1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2)

16. (1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2) Node 1 Node 2 Node 3

17. (1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2) Node 1 Node 2 Node 3

18. (1, 1) (2, 1)(0, 1) (0, 0) (1, 0) (2, 0) (1, 2) (2, 2)(0, 2) Node 1 Node 2 Node 3

19. (1, 1) (2, 1)(0, 1) Node 1 Node 2 Node 3

20. (1, 1) (2, 1)(0, 1) Node 1 Node 2 Node 3 rdd.buﬀerTiles(…)

21. + + Interactive and Batch Processing of large raster data Web-Speed Processing of small to medium sized raster data

22. GeoTrellis Ecosystem Raster Foundry by Spark SQL and Spark ML support Raster Frames by Spark SQL and Spark ML support GeoPySpark Python bindings Vector Pipes Vector Tiles on Spark PDAL integration Point Clouds on Spark

23. GeoPySpark

24. Started December 2016 Follows PySpark’s model of communication between the JavaVirtual Machine and Python Access GeoTrellis functionality through Python, and integrates with your favorite python raster tools (numpy + friends). 0.2 is released! GeoPySpark

25.

26.

27.

28. EXERCISE 1: ANALYZING LAND COVER DATA

29. EXERCISE 2: WORKING WITH LANDSAT IMAGERY AND NDVITHROUGHTIME

30.

31. (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … SpaceTimeKey ≈ (col, row, instant)

32. (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … lambda lambda lambda (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) …

33. … (SpatialKey, [(DateTime, Tile) (DateTime, Tile)]) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, [(DateTime, Tile)]) …

34. … (SpatialKey, [(DateTime, Tile) (DateTime, Tile)]) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, (DateTime, Tile)) (SpatialKey, [(DateTime, Tile)]) (Shuﬄe) …

35. (SpatialKey, [(DateTime, Tile) (DateTime, Tile)]) (SpatialKey, [(DateTime, Tile)]) … mosaic (SpatialKey, Tile) (SpatialKey, Tile) … mosaic

36. BREAK!

37. WHERE AND HOW ARETHESE NOTEBOOKS RUNNING?

38.

39.

40.

41.

42. WHERE’STHIS DATA COMING FROM?

43. Supported Backends

44. EXERCISE 3: COMBINING LAND COVER AND NDVITO DETECT CROP CYCLES

45.

46. (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) …

47. (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … map_to_spatial (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) … map_to_spatial map_to_spatial STK = SpaceTimeKey

48. (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) … (SpatialKey, Tile) (SpatialKey, Tile) … ndwi_rdd nlcd_layer.to_numpy_rdd() (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile),Tile)) …

49. (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) (SpatialKey, (STK, Tile)) … (SpatialKey, Tile) (SpatialKey, Tile) … ndwi_rdd nlcd_layer.to_numpy_rdd() (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile),Tile)) … (Shuﬄe)

50. mask_ndwi mask_ndwi mask_ndwi (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) (SpaceTimeKey, Tile) … (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile), Tile)) (SpatialKey, ((STK, Tile),Tile)) …

51. EXERCISE 4: COMBINING IMAGERY, ELEVATION AND LAND COVER DATA TO MAKE A COOL LOOKING MAP

52. EXERCISE 4: COMBINING IMAGERY, ELEVATION AND LAND COVER DATA TO MAKE A COOL LOOKING MAP TWEETYOUR SWEET MAP SCREENSHOTS WITH #GEOPYSPARK #FOSS4G!

53. FINAL QUESTIONS?

54. Thank you!

Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop

Similar to Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop (20)

Recently uploaded

Recently uploaded (20)

Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FOSS4G 2017 Workshop