sqllineage, Release 1.5.3
3.1.2 Broader Use Cases
SQL lineage, or data lineage if we include generic non-SQL jobs, is the jewel in the data engineering crown. A well
maintained lineage service can greatly ease the pains that different roles in a data team suffer from.
Here’s a few classical use cases for lineage:
• For Data Producer
– Dependency Recommendation: recommending dependency for jobs, detecting missing or unneces-
sary dependency to avoid potential issues.
– Impact analysis: notifying downstream customer when data quality issue happens, tracing back to
upstream for root cause analysis, understanding how much impact it would be when changing a ta-
ble/column.
– Development Standard Enforcement: detecting anti-pattern like one job producing multiple pro-
duction tables, or temporary tables created without being accessed later.
– Job ETA Prediction and Alert: predicting table delay and potential SLA miss with job running
information.
• For Data Governor
– Table/Column Lifecycle: identifying unused tables/columns, and retiring it.
– GDPR Compliance: propagating the PII columns tag along the lineage path, easing the manually
tagging process. Foundation for later PII encryption and GDPR deletion.
• For Data Consumer
– Understanding data flow: discovering table, understanding the table from table flow perspective.
– Verify business logic: making sure the information I use is sourcing from the correct dataset with
correct transformation.
3.2 How Does SQLLineage Work
Basically a sql parser will parse the SQL statement(s) into AST (Abstract Syntax Tree), which according to wikipedia,
is a tree representation of the abstract syntactic structure of source code (in our case, SQL code, of course). This is
where SQLLineage takes over.
With AST generated, SQLLineage will traverse through this tree and apply some pre-defined rules to extract the parts
we’re interested in. With that being said, SQLLineage is an AST application, while there’s actually more you can do
with AST:
• born duty of AST: the starting point for optimization. In compiler world, machine code, or optionally IR
(Intermediate Representation), will be generated based on the AST, and then code optimization, resulting in
an optimized machine code. In data world, it’s basically the same thing with different words, and different
optimization target. AST will be converted to query execution plan for query execution optimization. Using
strategy like RBO(Rule Based Optimization) or CBO(Cost Based Optimization), the database/data warehouse
query engine outputs an optimized physical plan for execution.
• linter: quoting wikipedia, linter is a static code analysis tool used to flag programming errors, bugs, stylistic
errors and suspicious constructs. Oftentimes the name linter is used interchangeably with code formatter. Famous
tools like flake8 for Python, ESLint for JavaScript are example of real life linters. Golang even provide an official
gofmt program in their standard library. Meanwhile, although not yet widely adopted in data world, we can also
lint SQL code. sqlfluff is such an great tool. Guess how it works to detect a smelly “SELECT *” or a mixture of
leading and trailing commas. The answer is AST!
16 Chapter 3. Behind the scene