Designing IA for AI - Information Architecture Conference 2024
Google certified-professional-data-engineer
1. Google Certified Professional - Data Engineer
Job Role Description
A Google Certified Professional - Data Engineer enables data-driven decision making by collecting,
transforming, and visualizing data. The data engineer should be able to design, build, maintain, and
troubleshoot data processing systems with a particular emphasis on the security, reliability,
fault-tolerance, scalability, fidelity, and efficiency of such systems. The data engineer should also be able
to analyze data to gain insight into business outcomes, build statistical models to support
decision-making, and create machine learning models to automate and simplify key business processes.
Certification Exam Guide
Section 1: Designing data processing systems
1.1 Designing flexible data representations. Considerations include:
● future advances in data technology
● changes to business requirements
● awareness of current state and how to migrate the design to a future state
● data modeling
● tradeoffs
● distributed systems
● schema design
1.2 Designing data pipelines. Considerations include:
● future advances in data technology
● changes to business requirements
● awareness of current state and how to migrate the design to a future state
● data modeling
● tradeoffs
● system availability
● distributed systems
● schema design
● common sources of error (eg. removing selection bias)
1.3 Designing data processing infrastructure. Considerations include:
● future advances in data technology
● changes to business requirements
● awareness of current state, how to migrate the design to the future state
● data modeling
● tradeoffs
● system availability
● distributed systems
● schema design
● capacity planning
2. ● different types of architectures: message brokers, message queues, middleware,
service-oriented
Section 2: Building and maintaining data structures and databases
2.1 Building and maintaining flexible data representations
2.2 Building and maintaining pipelines. Considerations include:
● data cleansing
● batch and streaming
● transformation
● acquire and import data
● testing and quality control
● connecting to new data sources
2.3 Building and maintaining processing infrastructure. Considerations include:
● provisioning resources
● monitoring pipelines
● adjusting pipelines
● testing and quality control
Section 3: Analyzing data and enabling machine learning
3.1 Analyzing data. Considerations include:
● data profiling
● data correlation
● patterns and insights
● anomaly detection
● statistical models
● machine learning
● assessing the statistical relevance of conclusions
3.2 Transforming data to enable machine learning and pattern discovery. Considerations
include:
● repeatability
● generalization
● distributed computing
● improved model accuracy
3.3 Identifying or building data visualization and reporting tools. Considerations include:
● automation
● decision support
● data summarization
● enabling patterns and insights
3. Section 4: Modeling business processes for analysis and optimization
4.1 Mapping business requirements to data representations. Considerations include:
● working with business users
● gathering business requirements
4.2 Optimizing data representations, data infrastructure performance and cost.
Considerations include:
● resizing and scaling resources
● data cleansing, distributed systems
● high performance algorithms
● common sources of error (eg. removing selection bias)
Section 5: Ensuring reliability
5.1 Performing quality control. Considerations include:
● verification
● building and running test suites
● pipeline monitoring
5.2 Assessing, troubleshooting, and improving data representations and data processing
infrastructure.
5.3 Recovering data. Considerations include:
● planning (e.g. fault-tolerance)
● executing (e.g., rerunning failed jobs, performing retrospective re-analysis)
● stress testing data recovery plans and processes
Section 6: Visualizing data and advocating policy
6.1 Building (or selecting) data visualization and reporting tools. Considerations include:
● automation
● decision support
● data summarization, (e.g, translation up the chain, fidelity, trackability, integrity)
6.2 Advocating policies and publishing data and reports.
Section 7: Designing for security and compliance
7.1 Designing secure data infrastructure and processes. Considerations include:
● Identify and Access Management (IAM)
● data security
● penetration testing
● Separation of Duties (SoD)
● security control
7.2 Designing for legal compliance. Considerations include:
4. ● Health Insurance Portability and Accountability Act (HIPAA), Children’s Online
Privacy Protection Act (COPPA), etc.
● audits