本文共 21513 字,大约阅读时间需要 71 分钟。
azure
The real-life requirement
现实生活中的需求
Disclaimer: I assume dear Reader, that you are more than familiar with the general concept of partitioning and star schema modeling. The intended audience is people who used to be called BI developers in the past (with a good amount of experience), but they have all sorts of different titles nowadays that I can’t keep up with… I won’t provide a full Visual Studio solution that you can download and just run without any changes or configuration, but I will give you code can be used after parameterizing according to your own environment.
免责声明:亲爱的读者,我想您对分区和星型模式建模的一般概念非常熟悉。 目标受众是过去曾经被称为BI开发人员( 具有丰富的经验 )的人,但是如今他们拥有各种各样的头衔,我无法跟上……我不会提供完整的Visual您可以下载Studio解决方案,并且无需进行任何更改或配置即可直接运行它,但是我将为您提供可以根据您自己的环境进行参数化后使用的代码。
So, with that out of the way, let’s start with some nostalgia: who wouldn’t remember all the nice and challenging partitioning exercises for OLAP cubes? 🙂 If you had a huge fact table with hundreds of millions of rows it was at least not an efficient option to do a full process on the measure group every time, but more often it was out of the question.
因此,从某种意义上说,让我们从怀旧开始:谁不记得OLAP多维数据集所有出色而富挑战性的分区练习了? 🙂如果您有一个庞大的事实表,其中包含成千上万的行,那么,这至少不是每次对度量值组执行完整过程的有效选择,但更多时候是不可能的。
In this example, I have a fact table with 500M+ rows that is updated hourly and I created monthly partitions. It is a neat solution and the actual processing takes about 3-4 minutes every hour, mostly because some big degenerate dimensions I couldn’t push out of scope. The actual measure group processing is usually 1-2 minutes and mostly involves 1-3 partitions.
在此示例中,我有一个包含500M +行的事实表,该表每小时更新一次,并创建每月分区。 这是一个很好的解决方案,实际处理每小时大约需要3-4分钟,这主要是因为我无法排除某些较大的退化尺寸。 实际的度量值组处理通常为1-2分钟,并且主要涉及1-3个分区。
I know OLAP is not dead (so it is said) but not really alive either. One thing is for sure: it is not available as PaaS (Platform as a Service) in Azure. So, if you want SSAS in the Cloud, that’s tabular. I assume migration/redesign from on-premise OLAP Cubes to Azure Tabular models is not uncommon. In the case of a huge table with an implemented partitioning solution, that should be ported as well.
我知道OLAP并没有死(据说),但也没有真正存活。 可以肯定的是:它在Azure中不能作为PaaS(平台即服务)使用。 因此,如果您想在云中使用SSAS,那就是表格 。 我认为从本地OLAP多维数据集到Azure Tabular模型的迁移/重新设计并不少见。 对于具有已实现分区解决方案的大型表,也应将其移植。
Where Visual Studio provided a decent GUI for partitioning in the OLAP world, it’s not the case for tabular. It feels like a beta development environment that has been mostly abandoned because the focus has been shifted to other products (guesses are welcome, I’d say it’s Power BI but I often find the Microsoft roadmap confusing especially with how intensely Azure is extending and gaining an ever growing chunk in Microsoft’s income).
Visual Studio在OLAP世界中为分区提供了不错的GUI,而表格格式则不是这种情况。 感觉就像是一个Beta开发环境,由于重点已转移到其他产品而被放弃了( 欢迎猜测,我想说的是Power BI,但我经常发现Microsoft的路线图令人困惑,尤其是与Azure的扩展和获取程度有多大的困惑。微软收入中不断增长的一块 )。
In short: let’s move that dynamic partitioning solution from OLAP into Azure Tabular!
简而言之:让我们将动态分区解决方案从OLAP迁移到Azure Tabular!
Goal
目标
The partitioning solution should accommodate the following requirements:
分区解决方案应满足以下要求:
The process of Dynamic Partitioning
动态分区的过程
Used technology
二手技术
My solution consists of the below components:
我的解决方案包含以下组件:
C# scripts inside SSIS utilizing TOM (Tabular Object Model) – used in this solution
SSIS中利用TOM(表格对象模型)的 C#脚本–在此解决方案中使用
No, the second one is not Jerry 🙂 I am not sure the two methods would get on well in that cat-mouse relationship…
不,第二个不是杰里(Jerry)🙂我不确定这两种方法在猫鼠关系中能否相得益彰……
Let’s get to it, going through the steps from the diagram one-by-one!
让我们开始吧,一步一步地完成图中的步骤!
Overview
总览
The below objects are used in the solution.
解决方案中使用了以下对象。
Object name | Type | Functionality |
ETL_Tabular_Partition_Config | Table | Stores metadata for partitions that are used when defining the new ones |
ETL_Tabular_Partition_Grain_Mapping | Table | A simple mapping table between conceptual partition periods (e.g. Fiscal Month) and the corresponding Dim_Date column (e.g. Fiscal_Month_Code), this allows to tune partitioning periods dynamically |
Dim_Date | Table | A fairly standard, pre-populated date table |
ETL_Tabular_Partitions_Required | Table | The master list of changes for partitions, including all that needs to be created / deleted / processed (updated) |
pr_InsertTabularPartitionsRequired | Stored procedure | That’s the heart of the SQL side of dynamic partitioning (details below) |
ETL_Tabular_Partitions_Existing | Table | A simple list of partitions that currently exist in the deployed database |
pr_InsertTabularPartitionsExisting | Stored procedure | A simple procedure that inserts a row into ETL_Tabular_Partitions_Existing and is called from a C# enumerator that loops through the existing partitions of the tabular database |
Tabular_Partition.dtsx | SSIS package | This SSIS package is used as an orchestration of the different components of the project. In this 1st step the pr_InsertTabularPartitionsRequired stored procedure is called |
对象名称 | 类型 | 功能性 |
ETL_Tabular_Partition_Config | 表 | 存储定义新分区时使用的分区的元数据 |
ETL_Tabular_Partition_Grain_Mapping | 表 | 概念分区周期(例如,财政月)和相应的Dim_Date列(例如,Fiscal_Month_Code)之间的简单映射表,这允许动态调整分区周期 |
点心日期 | 表 | 相当标准的预填充日期表 |
ETL_Tabular_Partitions_Required | 表 | 分区更改的主列表,包括所有需要创建/删除/处理(更新)的更改 |
pr_InsertTabularPartitionsRequired | 存储过程 | 这是动态分区SQL方面的核心(详细信息如下) |
ETL_Tabular_Partitions_Existing | 表 | 部署数据库中当前存在的分区的简单列表 |
pr_InsertTabularPartitions现有 | 存储过程 | 一个简单的过程,将一行插入到ETL_Tabular_Partitions_Existing中,并从C#枚举器调用,该循环遍历表格数据库的现有分区 |
Tabular_Partition.dtsx | SSIS套件 | 此SSIS包用作项目不同组件的编排。 在该1 个工序中的pr_InsertTabularPartitionsRequired存储过程被称为 |
Date configuration
日期配置
For the date configuration, I use the ETL_Tabular_Partition_Config, the ETL_Tabular_Partition_Grain_Mapping and the Dim_Date table. A simplified version for demo purposes:
对于日期配置,我使用ETL_Tabular_Partition_Config,ETL_Tabular_Partition_Grain_Mapping和Dim_Date表。 出于演示目的的简化版本:
TOM – Tabular Object Model
TOM –表格对象模型
I chose C# for this script’s language and the TOM (Tabular Object Model) objects are required to interact with tabular servers and their objects. To use them some additional references are needed on the server (if you use ADF and SSIS IR in the cloud, these are available according to the Microsoft ADF team) that are part of the SQL Server 2016 Feature Pack. You can find more info about how to install it here:
我选择C#作为该脚本的语言,并且TOM(表格对象模型)对象是与表格服务器及其对象进行交互所必需的。 要使用它们,服务器上需要一些其他引用( 如果您在云中使用ADF和SSIS IR,根据Microsoft ADF团队的要求,这些引用是SQL Server 2016 Feature Pack的一部分。 您可以在此处找到有关如何安装的更多信息:
And the official TOM Microsoft reference documentation can be very handy:
官方的TOM Microsoft参考文档可能非常方便:
The part that is related specifically to the partitions:
与分区特别相关的部分:
Variables
变数
The below variables are needed to be passed from the package to the script:
需要将以下变量从包传递到脚本:
Make sure you include the above variables, so they can be used in the script later on:
确保包括上述变量,以便稍后可以在脚本中使用它们:
The syntax for referencing them (as it’s not that obvious) is documented here:
引用它们的语法(不太明显)在此处记录:
Main functionality
主要功能
The script itself does nothing else but loops through all existing partitions and calls a stored procedure row-by-row that inserts the details of that partition into a SQL table.
该脚本本身不执行其他任何操作,而是循环遍历所有现有分区并逐行调用存储过程,该存储过程将该分区的详细信息插入到SQL表中。
All this logic is coded into pr_InsertTabularPartitionsRequired (feel free to use a better name if you dislike this one) and in high level it does the following:
所有这些逻辑都编码为pr_InsertTabularPartitionsRequired(如果您不喜欢此名称,请随意使用更好的名称),并在较高级别执行以下操作:
Gray means T-SQL, white is C# (see the previous section), dark grey is putting everything together.
灰色表示T-SQL,白色表示C#( 请参阅上一节 ),深灰色表示将所有内容组合在一起。
Here is the code of my procedure, it works assuming you have the three tables defined previously and you configured the values according to your databases / tables / columns.
这是我的过程的代码,假设您已预先定义了三个表,并且已根据数据库/表/列配置了值,则该代码可以正常工作。
It is mostly self-explanatory, and the inline comments can guide you as well. Some additional comments:
它主要是不言自明的,内联注释也可以指导您。 一些其他评论:
Additionally, a WHERE clause for each partition is defined which can be used later when it is time to actually create them.
另外,为每个分区定义了一个WHERE子句,稍后可以在实际创建它们时使用。
Again, back to the C# realm.
再次回到C#领域。
Code Confusion
代码混乱
One particular inconsistency caught me as I had to spend half an hour to figure out why removing a partition has a different syntax then processing. It might be totally straightforward with people having a .NET background but different than how T-SQL conceptually work.
一个特别的不一致引起了我的注意,因为我不得不花半个小时来弄清楚为什么删除分区的语法与处理语法不同。 对于具有.NET背景但与T-SQL在概念上不同的人,这可能是完全简单的。
Tabular_Table.Partitions.Remove(Convert.ToString(Partition["Partition_Name"])); Tabular_Table.Partitions[Convert.ToString(Partition["Partition_Name"])].RequestRefresh(RefreshType.Full);
Conceptually
从概念上讲
Source query for new partitions
源查询新分区
How to assign the right query for each partition? Yes, we have the WHERE conditions in the ETL_Tabular_Partitions_Required table but the other part of the query is missing which has the date filtering to ensure there are no overlapping partitions. For that I use a trick (I am sure you can think of other ways, but I found this next one easy to implement and maintain): I have a pattern partition in the solution itself under source control. It has to be in line with the up-to-date view/table definitions otherwise the solution can’t be deployed as the query would be incorrect. I just need to make sure it always stays empty. For that a WHERE condition like 1=2 is sufficient enough (as long as the basic arithmetic laws don’t change). Its naming is “table name – pattern”
如何为每个分区分配正确的查询? 是的,我们在ETL_Tabular_Partitions_Required表中具有WHERE条件,但缺少查询的其他部分,该部分具有日期过滤功能以确保没有重叠的分区。 为此,我使用了一个技巧( 我相信您可以想到其他方法,但是我发现下一个易于实现和维护 ):我在源代码控制下的解决方案中有一个模式分区。 它必须与最新的视图/表定义保持一致,否则该解决方案将无法部署,因为查询将不正确。 我只需要确保它始终为空即可。 为此, 只要 1 = 2这样的WHERE条件就足够了( 只要基本算术定律不变 )。 它的名称是“ 表名-模式 ”
Then I look for that partition (see the details in the code at the end of the section), extract its source query, strip off the WHERE condition and then when looping through the new partitions, I just append the WHERE clause from the ETL_Tabular_Partitions_Required table.
然后,我寻找该分区( 请参阅本节末尾的代码中的详细信息 ),提取其源查询,剥离WHERE条件,然后在遍历新分区时,只需从ETL_Tabular_Partitions_Required表中追加WHERE子句。
string Tabular_Table_Name = "your table name";string Tabular_Partition_Pattern_Name = Tabular_Table_Name + " - pattern";//connect to tabular modelvar Tabular_Server = new Server();string Tabular_ConnStr = "your connection string"; Tabular_Server.Connect(Tabular_ConnStr);Database Tabular_Db = Tabular_Server.Databases[Tabular_Database_Name];Model Tabular_Model = Tabular_Db.Model;Table Tabular_Table = Tabular_Model.Tables[Tabular_Table_Name]; Partition Patter_Partition = Tabular_Table.Partitions.Find(Tabular_Partition_Pattern_Name);
Note: I use SQL queries not M ones in my source but here’s the code that helps you get both types from the tabular database’s partition using .NET once you have identified the proper partition that contains the pattern:
注意:我在源代码中使用的不是SQL查询,但下面的代码可帮助您在确定包含模式的适当分区后使用.NET从表格数据库分区中获取两种类型的代码:
For SQL
对于SQL
string Partition_Pattern_Query_SQL=((Microsoft.AnalysisServices.Tabular.QueryPartitionSource)(Pattern_Partition.Source.Partition).Source).Query.ToString();
For M
对于M
string Partition_Pattern_Query_M =((Microsoft.AnalysisServices.Tabular.CalculatedPartitionSource)(Pattern_Partition.Source.Partition).Source).Query.ToString();
Script steps
脚本步骤
Now I have the first half of the SQL query, I have the building blocks for this last step of the partitioning process:
现在,我有了SQL查询的前半部分,有了分区过程的最后一步的构建块:
Don’t forget that after the loop the tabular model must be saved and that is when all the previously issued commands are actually executed at the same time:
不要忘记在循环之后必须保存表格模型,也就是说,实际上同时执行了所有先前发出的命令:
Tabular_Model.SaveChanges();
The code bits that you can customize to use in your own environment:
您可以自定义以在自己的环境中使用的代码位:
So, by now you should have an understanding of how partitioning works in tabular Azure Analysis Services and not just how the processing can be automated but the creation / removal of the partitions based on configuration data (instead of just defining all the partitions beforehand until e.g. 2030 for all months).
因此,到目前为止,您应该已经了解分区在表格式Azure Analysis Services中的工作方式,不仅是如何自动化处理,而且还基于配置数据创建/删除分区( 而不是仅预先定义所有分区,直到例如到2030年为止 )。
The scripts – as I said at the beginning – cannot be used just as they are due to the complexity of the Azure environment and that the solution includes more than just a bunch of SQL tables and queries: .NET scripts and Azure Analysis Services.
正如我一开始所说的那样,由于Azure环境的复杂性,无法使用它们,因为该解决方案不仅仅包含一堆SQL表和查询:.NET脚本和Azure Analysis Services。
I aimed to use generic and descriptive variable and column names, but it could easily happen that I missed the explanation of something that became obvious to me during the development of this solution. In that case please feel free to get in touch with me using the comments section or sending an email to
我的目标是使用通用的和描述性的变量名和列名,但是很容易发生这种情况,因为我错过了在开发此解决方案时对我来说显而易见的解释。 在这种情况下,请随时使用评论部分与我联系或发送电子邮件至
Thanks for reading!
谢谢阅读!
翻译自:
azure
转载地址:http://efiwd.baihongyu.com/