3. Cloud & Big Data terms that you should know
2
I think that all
our doubts
will be
resolved
Brilliant,
awesome
… …what the idiots
face are putting
these two…
28. 27
Source: https://www.iowafarmbureau.com/f/4b37b785-c3c3-4773-9727-327e14e56d6b/segment1
‘Big Data has been providing a useful tool ‘to ensure that each
year we are improving our production plan. Small increases in
adopting change on the farm can lead to significant long term
success […] Every producer enters spring with the best plan for
their farm based on the information they have available […]
Increase value derived from traditional on-farm data sources:
leverage knowledge from planting, fertility, and yield maps to
make better input decisions’
Cloud computing is an Internet-based computing that largely offers on-demand access to computing resources.
Cloud computing is an Internet-based computing that largely offers on-demand access to computing resources.
A cluster is a collection of commodity computers connected together with a system of high speed network.
Grid computing combines computers from multiple administrative domains to reach a common goal, to solve a single task, and may then disappear just as quickly.
DFS is any file system that allows accessing to file from multiple hosts and may include replication and fault tolerance.
In sum, cloud computing can help with real-time computation, data access, and storage to users without having to know or worry about the physical location and configuration of the system that delivers the services.
Bit: binary digit (0 or 1)
Byte: one character (e.g. „Hello World“ -> 11 bytes)
Kilobyte: A paragraph (e.g. low resolution photo -> 100KB )
Megabyte: A short novel (e.g. A high-resolution photograph -> 2MB)
Gigabyte: 7 minutes of HD-TV video (e.g. A library floor of academic journals -> 100 GB)
Terabyte: 300 hours of good quality video (e.g. The printed collection of the entire Library of Congress -> 10TB)
Petabyte: 500 billion pages of standard printed text (e.g. The amount of data processed by Google daily -> 20 PB)
Exabyte: 2 million personal computers (e.g. Total data held by Google -> 15 EB)
Zettabyte: 250 billion DVDs
Yottabyte: Size of the entire World Wide Web (It would take approximately 11 trillion years to download a Yottabyte file from the internet using high-powered broadband.)
Brontobyte: 1 followed by 27 zeros!
Geopbyte: No one knows why this term was created. It is highly doubtful that anyone alive today will EVER see a Geopbyte hard drive.
Volume: how much data we have
Data production will be x44 times bigger in 2020 than in 2009
Volume
Velocity: speed in which data is accessible
Variety: Data in many forms
Variety
Metadata is data that describes other data
A data model refers to the logical inter-relationships and data flow between different data elements involved in the information world.
To make it easier for END-USERS and CITY DATABASES and IoT internet-of-things and 3rd-party APPS to exchange INFO.
Exchanging data and ontology allows Users and A.I. to see meaning.
Variability refers to data whose meaning is constantly changing.
Veracity: Big Data is the messy, noisy nature of it, and the amount of work that goes in to producing an accurate dataset before analysis can even begin.
Visualization: how to show the data
Value of Data: the last step, after all of them you want to get value of your data
Agricultural companies, governments, organizations, researchers (from academia and industry) generate, maintain and use huge amount of data related to agricultural production, weather and climate, insurance, marketing, supply chain, packaging, distribution, etc.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. (bounded?)
Apache Spark is a unified analytics engine for large-scale data processing.
Apache Zeppelin Web-based notebook that enables data-driven, interactive data analytics and collaborative documents
Apache Kafka is used for building real-time data pipelines and streaming apps.
Apache Storm is a distributed stream processing computation framework
Apache Nifi is a framework to automate the flow of data between software systems.
Apache Cassandra is a free and open-source distributed wide column store NoSQL database management system.
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation in batch processing.
OLAP: Online Analytical Processing
RDBMS: Relational DB Management System
Analytics, making inference over the large set of data.
Descriptive Analytics: Insight into the past
Diagnostic Analytics: Explain why did it happen
Predictive Analytics: Understanding the future
Prescriptive Analytics: Advise on possible outcomes (outcomes?)
Dark Data: All the data collected by the companies but not processed
Data Lake: Large repository of data in raw format (raw?)
Data Mining: activity to find patterns and deriving insight in large set of data
Data Scientist: person who can make sense of huge dada
Machine learning is a method of data analysis that automates analytical model building. Machine learning is a type of artificial intelligence (AI) that enables software applications to become more accurate in forecasting outcomes without being specially programmed.
Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems.