Summery
Data Engineering
Topic | Main Points |
---|---|
Processing Techniques vs. Traditional Roles | New processing techniques are challenging traditional technology roles. They are changing the day-to-day work of data professionals. |
Types of Data | There are two broad types of data: structured and unstructured. |
Relational Databases | In relational database systems like Microsoft SQL Server, Azure SQL Database, and Azure SQL Data Warehouse, data is defined in tables. |
Non-Relational Systems | Unstructured data is stored in non-relational or NoSQL systems. |
Data Engineers and Data Types | Data engineers work with unstructured data and various new data types, including streaming data. |
Data Extraction and Transformation | Data engineers extract raw data from structured or unstructured data pools. They transform data from the source schema to the destination schema. |
Transformation Processes | ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two approaches to data transformation. |
Disadvantages of ETL | ETL transformation can take a long time and tie up source system resources. |
Migration to Cloud | Factors to consider when migrating from on-premises to cloud-based solutions. |
Roles in Data Industry | Development of data engineer, data scientist, and artificial intelligence engineer roles. |
Data Project Phases | High-level architecture example following five data project phases: sourcing, ingest, preparation, analysis, and consumption. |
Azure Storage Accounts | Base storage type within Azure with four configuration options: blob, files, queue, and table. |
Data Ingestion Tools | Introduction to tools for ingesting data into Azure storage. |
Considerations for Optimal Storage | Key factors to consider when choosing the optimal storage solution. |
Features of Azure Storage Accounts | Scalability, security, durability, high availability, and Azure's management of maintenance and critical issues. |
Data Encryption in Azure Storage | How Azure storage encrypts data and provides access control. |
Azure Data Lake Storage | Overview of Data Lake Storage with Hadoop-compatible data repository and compute capabilities. |
Data Ingestion and Querying | Methods for data ingest and querying using tools like Azure Data Factory, Apache Sqoop, and uSQL. |
Azure Cosmos DB | Globally distributed multi-model database with strengths in uptime, replication, and consistency. |
Usage and Deployment of Cosmos DB | Deployment options, data ingestion, and query capabilities of Azure Cosmos DB. |
Security Features in Cosmos DB | Data encryption, IP firewall configurations, and access control from virtual networks. |
Azure SQL Database (PaaS) | Managed relational database service supporting structured and unstructured data. |
Features of Azure SQL Database | Comprehensive security, scalability, and support for OLTP systems. |
Data Ingestion and Querying in SQL DB | Ingestion through application integration, querying using T-SQL, and security features. |
Data storage
Topic | Main Points |
---|---|
Choosing the Right Storage Solution | Considerations for selecting the optimal storage solution for various datasets. |
Data Classification | Data classification into structured, semi-structured, and unstructured categories. |
Data Serialization Languages | Use of serialization languages like JSON, XML, and YAML for semi-structured data exchange. |
Factors for Choosing Storage Solutions | Consideration of data type, operations, latency, and transactional support. |
ACID Guarantees | Explanation of ACID guarantees (Atomicity, Consistency, Isolation, Durability) in transactions. |
Azure Services for Data Storage | Introduction to Azure SQL Database, Azure Analysis Services, and Azure Cosmos DB. |
Azure Blob Storage | Use cases for Azure Blob Storage, including tiered storage and integration with CDN. |
Other Azure NoSQL Storage Options | Mention of Azure Table Storage, Azure HBase, and Azure Cache for Redis for NoSQL data storage. |
Azure Storage Accounts | Overview of storage accounts, their settings, and considerations for their usage. |
Creating Storage Accounts | Explanation of tools for creating storage accounts, including Azure Portal and Azure CLI. |
Microsoft Azure Storage Overview | Features of Microsoft Azure storage, including durability, security, and scalability. |
Types of Azure Data Services | Description of Azure Blob Storage, Azure Files, Azure Queue Service, and Azure Table Storage. |
Azure Storage REST API | Use of the Azure Storage REST API for operating on containers and data in storage accounts. |
Client Libraries for Azure Storage | Availability of client libraries in various languages and frameworks for faster development. |
Accessing Azure Storage | Connecting apps to Azure storage accounts using access keys, rest API endpoints, and connection strings. |
Secure Access Key Management | Best practices for managing secure access keys, including rotation and shared access signatures. |
Encryption in Azure Storage | Explanation of Storage Service Encryption (SSE) and transport level security for data encryption. |
Access Control and Role-Based Access | Use of role-based access control (RBAC) with Azure Active Directory for resource and data operations. |
Storage Analytics and Logging | Introduction to Storage Analytics, real-time logs, and their filtering and search capabilities. |
Threat Protection in Azure Storage | Overview of Azure Defender for Storage, security alerts, and integrated threat mitigation. |
Security Features in Azure Data Lake | Security features in Azure Data Lake Storage, including authentication schemes and management tools. |
Blob Storage API and Usage | Overview of Blob Storage API, supported operations for blobs and containers, and data organization. |
Types of Blobs and Their Usage | Explanation of block blobs, append blobs, and page blobs, and their appropriate use cases. |
Designing Data Storage in Azure | Considerations for organizing data across storage accounts, containers, and blobs. |
Preparing for Certification | Encouragement to take a practice exam for certification preparation. |
Data Integration with Microsoft Azure Data Factory
Topic | Main Points |
---|---|
Evolving World of Data | Changes in the data world, including new technologies and rule adjustments. |
Adaptation as a Data Engineer | The importance of data engineers understanding and adapting to these changes. |
ELT and ETL Processors | Explanation of ELT and ETL processors in data platforms. |
Shift towards ELT Approach | How developments in data platform technologies favor an ELT approach. |
Holistic Data Project Approach | The shift towards predictive and preemptive analytics and the need for a holistic data project view. |
Healthcare IoT Use Case | An example of an IoT deployment in healthcare and its impact on data engineers. |
Azure Data Factory Tools | Overview of Azure Data Factory and its data transformation and cleansing capabilities. |
Data Transformation in ADF | Common data transformation and cleansing activities within Azure Data Factory. |
Orchestration in Azure Data Factory | How Azure Data Factory orchestrates data movements and transformations. |
Data Factory Control Flow | Exploration of Data Factory control flow, pipelines, debugging, and parameters. |
SSIS in Azure Data Factory | The integration of SQL Server Integration Services (SSIS) with Azure Data Factory. |
Azure DevOps and GitHub Integration | How Azure Data Factory integrates with Azure DevOps and GitHub for source control and CI/CD. |
Data Integration with Azure Data Share | Simplifying data integration from multiple sources using Azure Data Share and Data Factory. |
Completing Data Integration | Overview of data integration at scale with Azure Data Factory and Azure Synapse pipelines. |
Azure Synapse Analytics
Topic | Main Points |
---|---|
Azure Synapse Analytics Overview | Introduction to Azure Synapse Analytics as a unified platform for various data-related tasks. |
Capabilities of Azure Synapse Analytics | Explanation of the capabilities, including SQL pools, Spark pools, data integration, and visualization. |
Use Cases for Azure Synapse Analytics | Common use cases for Azure Synapse Analytics, such as data warehousing, analytics, and integration. |
Components of Azure Synapse Analytics | Introduction to components like Azure Synapse Workspace, data warehouse, and data virtualization. |
Data Warehousing in Azure Synapse | Overview of data warehousing, its role, and key points related to data extraction and transformation. |
Apache Spark in Azure Synapse | Explanation of Apache Spark usage within Azure Synapse Analytics via Spark pools. |
Azure Synapse Pipelines | Introduction to Azure Synapse Pipelines for cloud-based ETL and data integration workflows. |
Azure Synapse Studio Overview | Overview of Azure Synapse Studio as a web UI for data exploration, development, and management. |
Hubs in Azure Synapse Studio | Explanation of different hubs within Azure Synapse Studio for various data-related tasks. |
Analytical Processes in Azure Synapse | Overview of the analytical processes, data ingestion, preparation, and data shaping in Synapse. |
Building a Modern Data Warehouse | Explanation of modern data warehouse architecture, including staging areas and data formats. |
Staging Area in Data Warehousing | Reasons for adding a staging area, including reducing dependencies and handling source systems. |
Data Storage in a Data Warehouse | Recommendations for data format (Parquet) and best practices for Azure Data Lake Storage usage. |
Azure Synapse Apache Spark Pools
Topic | Main Points |
---|---|
Building a Modern Data Warehouse | Explanation of the process, including data ingestion, preparation, and data accessibility. |
Best Practices for Azure Data Lake Storage | Considerations when working with Azure Data Lake Storage, including data structure and file sizes. |
Star Schema Design | Designing a star schema and distinguishing dimension and fact tables. |
Data Loading in Azure Synapse Analytics | Importance of data loading, minimizing impact, and tools for loading data into Synapse Analytics. |
Managing Data Workloads | Managing resource availability, workload importance, and optimizing performance in Synapse Analytics. |
Table Distribution and Indexing | Impact of table distribution on data load and query performance, and indexing strategies. |
Materialized Views | Improving query performance with materialized views and read committed snapshot isolation levels. |
Query Optimization Techniques | Techniques like result set caching, approximate execution, and stored procedures for optimization. |
Compute Resource Management | Pausing and resuming compute resources to reduce costs and utilizing Azure Advisor recommendations. |
Column store Index and Materialized Views | Exploring the benefits of column store indexes and materialized views in Synapse SQL pools. |
Logged Operations and Efficiency | Differentiating fully logged and minimally logged operations for performance and efficiency. |
Security and Authentication | Configuring authentication, network security, and securing keys using Azure Key Vault. |
Authorization and Row-Level Security | Managing authorization through column and row-level security in Azure Synapse Analytics. |
Encryption and Transparent Data Encryption | Implementing encryption with Transparent Data Encryption (TDE) to protect Synapse Analytics. |
Big Data Engineering with Spark | Introduction to Apache Spark and its role in Azure Synapse Analytics. What Apache Spark pools are, their purpose, and benefits. How Apache Spark applications work within Azure Synapse Analytics. Creation and management of Spark pools in Azure Synapse Analytics Studio. Use cases for Spark pools in various Azure services. |
Query Pools and Workload Management | Integration of SQL and Apache Spark pools in Azure Synapse Analytics. The Azure Synapse Apache Spark to Synapse SQL connector and its capabilities. Benefits of interoperability between Apache Spark and SQL in data exploration, loading, and sharing. |
Workload Monitoring | Monitoring Spark pools and Azure Synapse Analytics using the Monitor Hub. Identifying poorly performing Spark pool runs and areas for optimization. Optimization strategies, including data format, caching, memory efficiency, bucketing, and job execution. |
Operational Analytics with Azure Synapse Analytics
Explanation of how Azure Cosmos DB uses Hybrid Transactional Analytical Processing (HTAP). Azure Synapse Analytics features for querying the analytical store with SQL and Apache Spark. Introduction to Azure Synapse Link for Azure Cosmos DB as a Cloud-native HTAP capability.
Benefits of Azure Synapse Link for Azure Cosmos DB, including use cases in supply chain analytics, IoT, etc.
Topic | Main Points |
---|---|
Data Partitioning and Querying | Separation of transactional and analytical stores in Azure Cosmos DB. Partitioning data based on partition keys for efficient querying. Embedding entities within an array to optimize transactional data models. |
Configuring and Enabling Azure Synapse Link | Steps to configure and enable Azure Synapse Link for Azure Cosmos DB. |
Querying Azure Cosmos DB | Performing analytics and queries using Azure Synapse Link and serverless SQL pools. Aggregations, cross-container queries, complex JSON queries, windowing functions, and data visualization. |
Writing Data Back to Cosmos DB | Writing data back to the Azure Cosmos DB transactional store. Reading data from the transactional store. |
Azure Databricks
Azure Databricks is a notebook-oriented Apache Spark service in Azure. It provides a single platform for cluster management and interactive data exploration. Databricks improves Spark performance and reduces costs when running on Azure.
Topic | Main Points |
---|---|
Azure Databricks Architecture | Apache Spark functions through parallelism and clusters managed by a cluster manager. |
Azure Databricks supports data handling tasks like reading, writing, and querying. | |
Data Formats and Parquet Files | Reading and writing data in Databricks requires knowledge of various file formats. |
Importance of working with Parquet files in the Databricks file system. | |
Data Processing with DataFrames | DataFrames in Azure Databricks simplify data exploration and transformations. |
Performance Features in Azure Databricks | Catalyst optimizer stages: logical plan analysis, optimization, physical planning, and code generation. |
Security and Data Protection in Databricks | Databricks integrates with Azure services and offers high-level security. |
Access management using Azure role-based access control. | |
Delta Lake for Large Data | Delta Lake usage for managing large volumes of data. |
Capabilities such as table creation, appending, upsurging, and optimizations. | |
Azure Databricks for Streaming Data | Analyzing and processing streaming data using Apache Spark Structured Streaming. |
Integration with Azure Services | Azure Data Factory for executing workloads from Databricks. |
Creating data architecture to integrate multiple services. | |
Azure DevOps for CI/CD | Using Azure DevOps for Continuous Integration (CI) and Continuous Deployment (CD). |
Building release pipelines and source code repository for Databricks notebooks. | |
Integration with Azure Synapse Analytics | Integration of Databricks with Azure Synapse Analytics using SQL Data Warehouse Connector. |
Best Practices for Databricks | Following best practices for workspace administration, security, tools, integration, and more. |
Azure Data Lake Storage Gen2 and Data Streaming Solution
Azure Data Lake Storage is a scalable and cost-effective data lake solution for big data analytics in Azure. It provides a repository for storing large amounts of unstructured data for high-performance analytics.
Topic | Main Points |
---|---|
Benefits of Data Lake Storage Gen2 | Benefits include Hadoop-compatible access, security, performance, and data redundancy. |
Creating Azure Storage Account | Creating an Azure storage account using the Azure portal. |
Azure Blob Storage vs. Data Lake Storage Gen2 | Comparing Azure Blob Storage and Azure Data Lake Storage Gen2 and finding compatibility. |
Common Stages in Big Data Processing | Overview of the four common stages: ingestion, storage, preparation, training, modeling, and serving. |
Supported Open Source Platforms | Supported platforms using Azure Data Lake Storage Gen2 for real-time analytical solutions. |
Security Features of Data Lake Storage Gen2 | Exploring security features, including enterprise-class security, access control, and encryption. |
Data Streams and Event Processing | Explanation of data streams, event processing, and the need for time-based computations. |
Processing Events with Azure Stream Analytics | Learning how to process events using Azure Stream Analytics. |
Azure Event Hubs | Understanding the role of Azure Event Hubs in managing event streams. |
Configuring and Evaluating Azure Event Hubs | Creating and configuring event hubs and evaluating performance using the Azure portal. |
Stream Processing | Overview of stream processing and its role in analyzing and deriving insights from data streams. |
Azure Stream Analytics | Learning to process streaming data with Azure Stream Analytics, including operational aspects. |
Windowing Functionality | Introduction to the five main types of windowing functionality for data aggregation. |
Visualizing Aggregated Data in Power BI | How to visualize the results of aggregated data in Power BI at the end of the stream processing pipeline. |