Data Today: Matillion

Matillion's Data Productivity Cloud, explained for builders

2026-06-07T00:00:00Z

If you have used Matillion before, you probably picture the old ETL tool that ran on a VM you had to size, patch, and babysit. The Data Productivity Cloud (DPC) is the rebuilt, cloud-native successor, and it changes enough about how pipelines run and bill that it is worth a proper walkthrough. This guide is the starting point for our studio: what the platform actually is, how a pipeline executes, and where the money goes.

Matillion's pitch is that one platform should let three different kinds of people build the same pipeline: a low-code analyst dragging components in the Designer, an engineer writing dbt, SQL, or Python, and increasingly an AI agent under Matillion's Maia brand. The interesting question for a builder is not the marketing claim, it is the plumbing underneath: what runs where, what you pay for, and where the platform helps versus gets in the way.

What is the Data Productivity Cloud, concretely?

The DPC is a fully managed, browser-based environment for data integration. You do not run a Matillion instance anymore. Instead, Matillion hosts the control plane and you connect it to your cloud data warehouse, typically Snowflake, Databricks, Amazon Redshift, Google BigQuery, or Microsoft Fabric.

Two kinds of work happen in a DPC project:

Ingestion pulls data from sources (databases, SaaS APIs, files) into your warehouse using prebuilt connectors and change data capture.
Transformation reshapes that data once it lands, and this is the part that matters most for cost.

The key architectural fact is pushdown. When you build a transformation pipeline in the Designer, Matillion does not move rows through its own engine. It compiles your pipeline into SQL and pushes that SQL down to your warehouse, which does the heavy lifting. That single design choice explains most of the platform's cost behaviour, which we will come back to.

How does a pipeline actually run?

A DPC project separates two pipeline types on purpose, and mixing them up is the most common beginner mistake.

Transformation pipelines run SQL against your warehouse. They have no orchestration logic; they just transform.
Orchestration pipelines are the conductor. They run ingestion jobs, call transformation pipelines, branch on success or failure, loop with iterators, and handle scheduling.

A healthy project keeps a thin orchestration layer that calls many focused transformation pipelines, rather than one giant pipeline that tries to do everything. The same discipline you would apply to functions in code applies here: small, named, reusable units.

The chart below shows where the run minutes of a typical project go. The bulk of the time, roughly 60 percent, is transformation SQL executing inside your warehouse, around 30 percent is ingestion, and the remaining 10 percent is orchestration overhead. The exact mix varies, but the shape holds: your warehouse, not Matillion, is doing most of the work.

Illustrative: a typical split of pipeline run minutes across transformation pushdown, ingestion, and orchestration overhead. Data Today.

Where does the cost actually go?

This is the question that decides whether a DPC project stays affordable, and the answer has two halves that you pay separately.

Cost layer	What you pay for	Who bills you
Matillion credits	Pipeline runs and platform usage, metered by Matillion	Matillion
Warehouse compute	The SQL pushdown that transformations execute	Your cloud warehouse
Ingestion	Rows or connectors moved, depending on plan	Matillion

Because transformation runs as pushdown SQL, a badly written pipeline does not just burn Matillion credits, it runs an expensive query on your warehouse and shows up on a second bill. The most common cost surprise is a transformation that scans far more data than it needs, often because someone left a full reload where an incremental load belonged. Optimizing the SQL your pipeline generates is therefore a warehouse-cost exercise as much as a Matillion one, which is exactly why we treat cost as its own section.

What about the runners?

The DPC runs your pipelines on compute called runners. Matillion offers fully hosted runners so you do not manage infrastructure, and self-hosted or cloud-hosted runner options for teams that need pipelines to execute inside their own network, for example to reach a private database without exposing it to the internet.

The trade-off is the usual managed-versus-self-hosted one:

Hosted runners are zero-maintenance and the fastest way to start, but they run in Matillion's environment.
Self-hosted runners keep execution and credentials inside your perimeter, at the cost of you owning the compute and its upkeep.

If you handle regulated data or sit behind strict network controls, the runner choice is the first architectural decision to get right, well before you build a single pipeline.

Where does Maia, the AI layer, fit?

Maia is Matillion's name for the AI features layered across the DPC: copilots that help build and explain pipelines, agents that can take actions through API endpoints, and assistance for tasks like root cause analysis when a pipeline fails. The honest read is that this is the fastest-moving and least settled part of the platform, which is precisely why it deserves close, sceptical coverage rather than hype.

For a builder, the practical stance is to let AI accelerate the boring parts, generating boilerplate transformations, suggesting fixes, explaining an unfamiliar pipeline, while keeping a human reviewing anything that touches production data. We will track each Maia capability as it ships and judge whether it is genuinely production-ready or still a demo.

What should you do with this?

If you are evaluating or adopting the Data Productivity Cloud, a few principles travel well:

Treat your warehouse as the engine. Most of your cost and performance lives in the pushdown SQL, not in Matillion. Profile the queries your pipelines generate.
Keep orchestration thin and transformations small. Reusable, well-named pipelines age far better than monoliths.
Decide runners early. Network and compliance constraints shape the whole project.
Adopt Maia deliberately. Use it where review is cheap; gate it where mistakes are expensive.

This guide is the foundation. From here, the guides go deeper on each piece: ingestion connectors and change data capture, transformation patterns in the Designer, orchestration controls like iterators and scheduling, the FinOps habits that keep credits in check, and the Maia AI features as they land. The platform is moving quickly, and the goal here is the same as everywhere on Data Today: tell you what actually changed and what it means for the thing you are building.

Sources

Matillion Context Engine grounds Maia agent work

2026-06-07T00:00:00Z

A data agent that can build pipelines is useful. A data agent that knows which tables matter, which pipelines already touch them, and when to stop and ask you before firing a tool call is the version you can let near production.

Matillion Context Engine is the new public preview layer that gives Maia AI Agents knowledge graphs, crawlers, and task context, with Mission Control adding a 10 task in-progress cap around agent work.

Matillion shipped Context Engine and Mission Control together because the old problem with AI assistants in data engineering is not syntax. It is context. Maia can already build orchestration and transformation pipelines, query warehouse data, sample pipeline data, manage files, and commit or push changes inside the Data Productivity Cloud, according to Matillion's Maia AI Agents overview. Context Engine gives those agents a living map of your warehouse metadata, pipeline execution history, business language, and project scope. Mission Control gives you a kanban board where that work becomes task shaped instead of chat shaped.

If you are still getting oriented around the platform, start with our guide to Matillion's Data Productivity Cloud for builders. This piece assumes you already build in Maia or Designer and now need to decide where Context Engine belongs in your operating model.

What did Matillion actually ship in Context Engine?

Context Engine is in public preview, and the important object is the knowledge graph. Matillion describes it as a way to capture the structure, relationships, and meaning of your data, then let Maia use that graph when it works on a task in Mission Control or chat. In plain builder terms: it is metadata grounding for Maia, scoped to projects and fed by crawlers, not another Markdown rules file with a nicer name.

The Context Engine dashboard sits under the AI Agents icon in the left navigation. It lists knowledge graphs you can access, lets you filter by project, and supports search by name or description. From there, an Admin or Super Admin can add a knowledge graph, choose whether it is Public or Restricted, and give it a name and description, as Matillion's Context Engine documentation spells out.

That Public or Restricted choice is not cosmetic. A public knowledge graph is available in all projects. A restricted graph is available only to selected projects. Matillion explicitly warns that public graphs should not ingest sensitive data that should not be available to all projects. That is the kind of sentence you should read twice before turning a sales operations graph into a company wide default.

Once a graph exists, it has three important tabs:

Crawlers, where you add and monitor crawlers that populate the graph.
Projects, where you manage which projects can use a restricted graph.
Access, where you add users who can manage the graph.

The crawler model is refreshingly concrete. Matillion supports 2 crawler types: Warehouse data crawlers and Pipeline execution crawlers. Warehouse data crawlers harvest warehouses and structured sources supported through connectors. Pipeline execution crawlers harvest executions for a chosen project and environment, giving Maia operational context about how work actually flows.

Matillion documentation lists 2 Context Engine crawler types, 2 knowledge graph availability modes, 4 Mission Control board columns, and a 10 task in-progress limit.

As the chart shows, the launch is not just a new button. Context Engine and Mission Control introduce 2 crawler types, 2 graph availability modes, 4 task board columns, and a 10 task in-progress limit. Those numbers matter because they turn agent context into something you can scope, schedule, and govern.

Crawler setup has a few warehouse specific details. For Warehouse data crawlers, the data selection step depends on the target: Snowflake uses databases and schemas, Databricks uses catalog and schemas, and Amazon Redshift uses schemas. You can schedule crawler runs with Standard settings or Advanced schedule settings, including a cron expression. You can also run a crawler on demand with Run now.

The crawler status model gives you the minimum you need to operate it: Successful, Initializing, Extracting, Pending, Paused, and Failed. You can inspect the latest crawl or crawl history, including status, start time, end time, and duration. That is not observability nirvana, but it is enough to answer the first operational question: did the graph refresh before Maia used it?

How does Maia use the graph when it starts a task?

The graph becomes useful when you attach it to Maia's work. In Mission Control, the New task dialog includes Project, Branch, Environment, Knowledge Graph, and Prompt fields. The Knowledge Graph drop-down selects the graph Maia will use to inform that task, according to Matillion's Mission Control guide.

The same idea appears in the Agent Tasks API as graphId. That is the cleanest signal that Context Engine is not just a UI feature. You can pass graph context into programmatic agent work.

curl --request POST \
  --url "$BASE_URL/v1/ai/agents/tasks" \
  --header "Authorization: Bearer $MATILLION_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
    "message": "Plan a pipeline for daily net revenue by region using our governed sales model.",
    "agentConfig": {
      "name": "data_engineer_agent",
      "mode": "PLAN",
      "projectId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "sourceBranchName": "main",
      "environmentName": "development",
      "targetBranchName": "feature/revenue-region-plan",
      "graphId": "finance-analytics-graph"
    }
  }'

Matillion's Agent Tasks API guide defines name as currently always data_engineer_agent, mode as either ACT or PLAN, and graphId as the ID of a Knowledge Layer service graph. Use PLAN when you want the graph to shape a design without letting Maia make changes yet. Use ACT only when you are comfortable with the branch, environment, and permissions.

That branch point matters. Each task works in isolation on its own branch when you create work through the API. Matillion also says Agent Tasks API work runs under the identity of the user associated with the API key, and changes are not automatically visible to other project users unless Maia commits and pushes to the target branch. Good. Agent work should leave a diff, not a mystery.

There is one practical catch: the docs say these endpoints currently work only with Matillion-hosted and GitHub projects. If your team is standardized on another Git provider, test the path before you design a team process around it.

How is Context Engine different from context files?

Context files still matter. They are Markdown files that Maia always reads from .matillion/maia/rules/, and Matillion enforces a 12,000 character limit across all Markdown context files in that directory. They are the right place for rules: naming conventions, design standards, business glossary shortcuts, and team preferences.

Context Engine is for the map.

Here is the split that should guide your setup:

Mechanism	Best use	Concrete limit or setting
Context files	Always applied rules for a project	Stored under `.matillion/maia/rules/` with a 12,000 character total limit
Additional project files	Detailed standards Maia can reference when instructed	Stored outside `.matillion/...` and referenced from a context file
Context Engine knowledge graphs	Warehouse metadata, relationships, semantics, and execution context	Fed by Warehouse data and Pipeline execution crawlers

The best setup uses both. Put hard rules in context files: table prefixes, environment constraints, source of truth definitions, and review expectations. Put metadata and process reality in Context Engine: schemas, columns, tags, warehouse structures, and pipeline execution history.

Do not stuff everything into the graph because it feels newer. If a rule is small, stable, and mandatory, put it in a context file. If the context changes as pipelines run and schemas evolve, put it in the graph.

What does it cost, and where can it surprise the bill?

Matillion's public docs do not publish a separate Context Engine credit rate in the pages reviewed for this guide. That absence is its own buying signal. You should not treat public preview as free forever, and you should not assume every action has a visible line item until you validate it with your account data.

There are 3 cost surfaces to watch.

First, crawler frequency. A Warehouse data crawler connects to data warehouse structures such as Snowflake databases and schemas, Databricks catalogs and schemas, or Redshift schemas. Even if Matillion does not show a separate Context Engine meter in the public docs, that crawler is still operating against systems you pay to run. Start with a low frequency schedule, then increase it only for domains where schema drift or metadata freshness affects real delivery.

Second, agent work. Maia can validate and run pipelines, query the data warehouse, sample component output, commit changes, and push branches. Mission Control lets up to 10 tasks sit in the In progress column at once. Ten autonomous tasks pointed at a development warehouse can be a productivity win. Ten autonomous tasks repeatedly sampling, running, and revising pipelines can also turn a quiet sandbox into a noisy bill.

Third, preapproved tools. The Agent Tasks API supports grantedPermissions, and the Mission Control UI has Ask permission and Bypass permissions modes. Matillion recommends Ask permission as the default and Bypass permissions only for trusted hands-off runs in scoped environments. That is not conservative vendor boilerplate. It is the right default for anyone who has ever watched a retry loop discover money.

For cost monitoring, Matillion's MCP server exposes Consumption tools, including get-consumption for flat-rated products and get-consumption-etl-users for ETL users, and it can help analyze credit consumption patterns through an AI assistant via the MCP server documentation. Use that for investigation, but do not let an assistant be your only FinOps control. Pull consumption on a schedule, tag task branches clearly, and compare before and after you enable crawler schedules.

A sane rollout looks like this:

Create one restricted knowledge graph for a single analytics domain.
Add one Warehouse data crawler and one Pipeline execution crawler.
Schedule crawls outside peak warehouse windows.
Run Maia tasks in PLAN mode first.
Keep Ask permission on until you know which tools Maia calls repeatedly.
Review consumption and warehouse activity after one week.

The boring version wins.

When should you use Mission Control with Context Engine?

Use Mission Control when the work has a deliverable. A chat is fine for asking Maia to explain a Table Input component or suggest a naming convention. A Mission Control task is better when you want Maia to design a pipeline, modify project files, create a connector, analyze a failure, or work from a PDF spec.

The task board has 4 columns: Backlog, In progress, Needs attention, and Completed. That sounds simple because it is. The useful part is that each task has its own chat interface, and you can switch between tasks without losing context. For a data team, that maps better to real work than one endless assistant thread.

Mission Control also adds attachments. You can attach images and PDFs to a task prompt, then reference them with @filename. Matillion says images can include diagrams, screenshots, mockups, and whiteboard photos, while PDFs can include specs, requirements documents, and reports. Text files are not supported as attachments because Maia can already read project text files.

Use that for pipeline triage. A screenshot of a broken canvas plus a pipeline execution crawler is exactly the kind of mixed context that a human engineer would ask for. The difference is Maia can now carry that context into a branch and produce work you can review.

Do not use Mission Control as a merge gate. Matillion says completing a task does not make changes visible on other branches. You still need to commit, push, and merge. Maia cannot merge changes, which is a good boundary. Keep code review, pipeline tests, and environment promotion in your normal process.

What would I do first in a real Matillion team?

Start with one graph per domain, not one graph per company. Finance, customer, product, and operations usually have different semantics and different blast radiuses. A public all-company graph sounds convenient until it contains restricted HR tables or ambiguous definitions that poison every prompt.

Then tune freshness by risk. Operational pipeline execution context can go stale quickly if you are actively refactoring. Warehouse schema metadata may not need hourly crawls if your governed marts change weekly. Context Engine supports any number of Warehouse data and Pipeline execution crawlers on a graph, so separate them by source and schedule rather than creating one giant crawler that nobody wants to touch.

Use PLAN mode as your default for the first few tasks. Ask Maia to propose pipeline structure, sources, joins, variable usage, and tests while grounded in the selected graph. Once the plans look sane, let a narrow ACT task implement on a feature branch.

Most of all, measure whether Context Engine reduces clarification loops. The win is not that Maia can produce more pipeline files. The win is fewer wrong assumptions about customer_id, fewer duplicated staging models, fewer prompts that repeat your business glossary, and fewer reviews that start with: who told the agent to use that table?

If Context Engine does that, it earns a place in the stack. If it becomes another metadata garden that nobody prunes, Maia will learn your mess at machine speed.