Azure Synapse - first contact
Despite the fact I've spent the last two years working with Microsoft Azure cloud tech stack, I didn't have a chance working with Azure Synapse yet. Finally I've met this guy. The first impression is very good.
Azure Synapse Workspace
To create the first Azure Synapse you need to create an Azure Synapse Workspace. If you are already familiar with Azure you may already know Azure Log Analitycs Workspace to examine logs and operation of your applications and resources. Here is the same. Workspace is created within the Azure resource group, building a landing page for its settings, analytic pools, security, and monitoring aspects. Azure Synapse Workspace will require a storage account to be created/selected under-the-hood.
A storage account will be used to build Azure Data Lake Storage Gen2 for the files you will manage by yourself (sample data for processing) or by the Azure Synapse itself (ex. PySpark trained models will be saved here as well).
The better place to deal with the Azure Synapse is Synapse Analytics Workspace. It's a dedicated online portal, more specialized than Azure Synapse Workspace. Here you can define your Analitycs pools, design pipelines, manage connections, ... even create a code.
Development
Everyone familiar with Python and Jupyter will admire the notebook as a fundamental place to test and develop the code. The supported languages are PySpark (Python), Spark (Scala), .NET Spark (C#), and Spark SQL. Before the code run, a notebook has to be attached to the Managed Pool, which represents a scaleable runtime.
You may find your analytic pools within the Azure Sysnape Workspace. At least one SQL pool will be already created (as part of Azure Sysnape Workspace creation). It's an SQL DB, but it's stateless, so you will pay only for actual usage (per hour).
To run your script you may need to create a Spark Pool, that represents several preconfigured nodes (machines) with preinstalled runtimes, to run your notebook scripts. Lucky for us, we are charged only when we use those machines, so If we don't use them the pool is automatically paused (15 minutes idle is enabled by default).
If you prefer to use SQL script, Apache Spark job definition, or play with Data Flow using the graphical interface, you can do that. The notebooks work like a charm, especially when you want to write and immediately test the application using "Jupiter" within Azure Synapse Studio. The actual code you may want to store within Apache spark job definition (as PySpark/Spart/.NET Spark script files, persisted into Azure Storage), to extract code from the notebook and make it testable (unit test) and maintainable (GIT Repo) in a classical manner.
Git integration with Synapse
Azrue Synapse likes to manage GIT branches on his own. You may be surprised if you define your SDLC process up front or it's already definded by company polices. The feature branches and usage on master (collaborative branch) branch is clearly defined and has its meaning. You can commit code changes into the feature branch, but only publish from master. Once something is published, it's life (Life Mode).
A publishing process will try to collect information about pipelines, data set, data flows, Integration runtimes, linked services, triggers notebooks, spark jobs, and SQL scripts.
"By default, Synapse Studio generates the workspace templates and saves them into a branch called workspace_publish. To configure a custom publish branch, add a publish_config.json file to the root folder in the collaboration branch. When publishing, Synapse Studio reads this file, looks for the field publishBranch, and saves workspace template files to the specified location. If the branch doesn't exist, Synapse Studio will automatically create it. "
I still need to better understand how to combine this approach with a classical release process (using Azure DevOps pipelines), multiple environments (DEV, SIT, UAT... PRD), and approval flow. We may want to focus on the workspace templates as the actual deployment artifacts. To configure Linked Services we may require to use custom parameters, depending on the target environment (see: Azure Key Vault). It sounds like an area for struggling a bit with the predefined flow.
Faced issues
I'm still a bit confused by the GIT integration mechanisms and the promotion cycle. It shouldn't be an issue for someone who has experience working with Power BI (publishing to a shared workspace) or Qlik Sense.
Azure Synapse is pretty new (even if uses well-known pieces like Azure Data Factory), but in some cases there some integration gaps. One is them is the sql spark connector driver that doesn't support Azure Synapse yet. The appropriate changes have already been placed into the project master branch. Unfortunately, the latest version of the library has been released in October 2020. Nevertheless, it's not a problem of Azure Synapse itself.
"There are many ways to skin a cat".
How much it costs?
Given that you can use the serverless analytics pools it might be relatively cheap - a couple of dollars per hour. By using Azure Synapse for my private trainings, I used to spend ~10$ per day, which I didn't find a dirt cheap (an overall quota for VS Professional subscription Azure usage is 50$ only), but there's good news..
...There's a limited-time free quantities offer for Azure Synapse Analytics (until July 31, 2021).
"*Free quantities apply only to the following subscription types: Pay-as-you-go, Microsoft Azure Enterprise, Microsoft Azure Plan, Azure in CSP, or Enterprise Dev or Test. These included free quantities aggregate at the enrollment level for enterprise agreements and the subscription level for pay-as-you-go subscriptions."
https://azure.microsoft.com/en-us/blog/limited-time-free-quantities-offer-for-azure-synapse-analytics/
Summary
I'm extremely excited about Azure Synapse and eager to learn more. The next time I will try try to share my experience by using a more tangible examples with some PySpark code.
Comments
Post a Comment