The first three articles established what MatrixOne's git4data is, how to use it, and how it works underneath. Before moving into real-world practice, there's one more thing to settle:
"Version control for data" is not something only MatrixOne does. lakeFS, Dolt, Nessie, Snowflake, Neon, DVC — a crowd of products fly the "git for data" / "version control for data" banner. But what they all mean by it is not the same thing.
This article maps the comparison landscape of "git for data" in full: a framework first, then four categories grouped by "which layer the versioning lives at" (each with an architecture diagram), then a git-primitive completeness matrix that aligns everyone's git semantics — same word "git," but how much git, exactly — and finally MatrixOne's exact coordinates and its honest boundaries.
A framework first
To compare "data version control," you can't just ask "does it have diff/merge?" What actually separates these products is five questions:
- What does it version — file bytes? objects in object storage? table snapshots? or data rows?
- At which layer does it sit — inside the code repo? on top of object storage? on top of a table format? or inside the database kernel?
- How fine is the granularity — a whole file? a whole object? a snapshot? or a single row / cell?
- How complete are the git semantics — just "go back in time" (time-travel)? or branch, diff, three-way merge, cherry-pick — the full set?
- How are conflicts adjudicated — no merge at all? merge but only whole-file all-or-nothing? or row-level true/false conflict detection?
Of these, #2 (which layer) decides which category a tool belongs to, while #3 and #4 (granularity + git completeness) decide its standing within that category. So we'll group everyone into four categories by layer.
One running scenario
To make the differences visible, fix a concrete scenario and make every approach answer the same four actions:
You maintain a data asset — picture a 50-million-row user-feature table (or a 100GB multimodal training set of millions of files, whichever fits the category below). The team repeatedly does four things:
- Action A | See the change: after editing a batch of data, know precisely which rows changed, and to what (git diff).
- Action B | Parallel collaboration: several people each open a branch, edit in parallel, merge back to mainline, with conflicts adjudicated (git branch + merge).
- Action C | Cherry-pick / pull data: lift one change onto another branch (cherry-pick), or pull a batch of data directly from a specific version.
- Action D | Incident recovery: a dropped table or a botched bulk edit — bring it back in seconds (git reset / restore).
Hold these four actions, and the details of each approach come into focus.
Category 1: Git-native file versioning — DVC / Git LFS
Approach: treat data as "large files," store pointers in Git and bytes in a remote cache (S3, etc.). What you see in Git is a tens-of-bytes .dvc pointer file; the real data sits in a content-addressed cache. Version control is entirely borrowed from Git — what you commit / branch / merge is the pointer files.
- Git LFS: purely solves "don't blow up the Git repo with big files," handling storage only — it doesn't understand file contents, so diffing two versions compares the pointers, not the data.
- DVC: a stronger take on the idea; its strength is binding data + model + code + pipeline DAG together (
dvc add,dvc repro,dvc exp run) to make an ML run reproducible.
In the scenario:
- Action A (see the change):
dvc difftells you "train.parquetchanged" — and that's it. Which 310 rows? Invisible.
- Action B (parallel merge): no data-aware merge; you fall back to Git's text merge of the pointer files. Two people editing the same data file is one binary conflict.
- Action C (cherry-pick / pull data): no row-level cherry-pick; pulling data means
dvc checkoutthe whole file locally, then feed pandas / Spark.
- Action D (incident recovery):
git checkout <old commit>+dvc checkout, swapping files back — at file granularity.
Key difference from MatrixOne: DVC's git semantics are borrowed from Git and act on files; MatrixOne's are kernel-native and act on rows.
Category 2: git-for-data over object storage — lakeFS / Pachyderm
Approach: a layer over object storage (S3/OSS) providing git-style commit / branch / merge / revert, acting on all objects in the whole repository. It needs a standing lakeFS server + a metadata KV (typically PostgreSQL / DynamoDB); the data bytes stay in your object store, lakeFS owns only the metadata; to actually compute, you still need an external engine.
lakeFS is this category's flagship, and a product MatrixOne is frequently asked to be compared against. Its git semantics are actually quite complete (branch / commit / merge / revert) — but the granularity is on an entirely different level.
In the scenario (using the 100GB multimodal set):
- Action A (see the change): the same batch changed 310 annotation rows, and
lakectl diff lakefs://repo/main lakefs://repo/featurereports "6 objects changed" — its diff granularity is the object (whole file), it can't show which 310 rows. For row-level, layer Iceberg + Spark on top (lakeFS now ships arefs_data_diffSpark SQL function — but computed in Spark, not in lakeFS).
- Action B (parallel merge): lakeFS merge is an object-level three-way merge — different objects merge cleanly; both branches touching the same parquet file is a conflict, and it can't fuse the two batches of rows inside the file — only
source-wins/dest-wins.
- Action C (cherry-pick / pull data): it can cherry-pick, but the unit is still the object; pulling data means reading objects out into an external engine (Spark / Trino / DuckDB).
- Action D (incident recovery):
lakectl branch revertto a historical commit — at object granularity, whole-repo consistent.
Where it's strong: scope is the whole repository — one commit / branch / merge naturally covers all files, making multi-file, cross-format atomic consistency effortless; content-level versioning of massive unstructured bytes (images / video / audio / weights) is its home turf.
Key difference from MatrixOne: lakeFS's git is complete but stops at the object granularity; MatrixOne brings the same git semantics down to the row. The former is "Object Git over object storage," the latter "Git grown inside the database" — lakeFS owns the bytes, MatrixOne owns the catalog and labels — complementary, and a later article covers how they combine.
Same layer: Pachyderm — a data-driven pipeline that auto-triggers on data change, plus lineage; same file / commit granularity.
Category 3: open table format + git branches — Iceberg / Delta + Nessie
Approach: open table formats like Iceberg / Delta Lake / Hudi represent a table as a chain of immutable snapshots (each write = a new snapshot), with built-in time travel; layer a catalog like Nessie on top and tables gain git-style branches / tags / merge — and a single commit can span multiple tables. Querying is delegated entirely to external engines.
-- Iceberg: read history by snapshot id or timestamp
SELECT * FROM db.t FOR SYSTEM_VERSION AS OF 10963874102873;
SELECT * FROM db.t FOR SYSTEM_TIME AS OF '2026-06-10 01:21:00';
-- Nessie: cross-table branch & merge
CREATE BRANCH etl IN nessie;
MERGE BRANCH etl INTO main IN nessie;In the scenario:
- Action A (see the change): versioning is at the snapshot level. Which rows differ between two snapshots, the format doesn't directly tell you — an external engine has to diff the two snapshots.
- Action B (parallel merge): Nessie's merge is a real merge, but at table-snapshot granularity — a conflict means "the same table was changed on both branches" (optimistic concurrency, content-key conflict), not a row-level three-way merge. It delegates rows to the table format.
- Action C (cherry-pick / pull data): Nessie supports commit-level cherry-pick (still table/snapshot level); pulling data is Spark / Trino / Flink / Dremio.
- Action D (incident recovery):
RESTORE TABLE t TO VERSION AS OF 123(Delta) / roll back to a snapshot.
Key difference from MatrixOne: this path's strength is open ecosystem, multi-engine interoperability, lakehouse scale — one dataset that Spark reads, Trino reads, Flink writes. The cost is git semantics that stop at the table-snapshot level and no built-in compute engine.
Category 4: in-database version control — Dolt, Snowflake, Neon, and MatrixOne
Approach: don't add a layer outside the database — let the database kernel itself manage versions, understanding the semantics of every row and storing changes as immutable increments. MatrixOne belongs here, alongside Dolt, Snowflake, and Neon. But even within "a database doing version control", their git completeness differs widely.
Dolt: the git workflow made into a database
A Merkle / prolly-tree storage layer makes cell-level diff/merge a natural byproduct, and it brings a developer's full git over:
SELECT * FROM dolt_diff_orders WHERE to_commit = HASHOF('HEAD') AND from_commit = HASHOF('HEAD^');
CALL DOLT_MERGE('feature'); -- cell-level three-way merge, conflicts in dolt_conflicts_orders
SELECT * FROM orders AS OF 'HEAD~20'; -- time travelPlus row-level dolt blame / history, remote clone / push / pull, and the hosted DoltHub — the "deepest" git of all: even the distributed collaboration workflow is there. But Dolt is a single-node database with no OLAP support, so its applicable scenarios are relatively limited.
Snowflake / BigQuery / Neon: zero-copy clone + time travel
CREATE TABLE t_clone CLONE t; -- zero-copy clone (seconds)
SELECT * FROM t AT(OFFSET => -300); -- how it looked 5 minutes ago
UNDROP TABLE t; -- rescue a dropThey can "branch" and "go back in time," which feels great, but their git semantics only do the first half:
- Snowflake: zero-copy clone + time travel (1 day default; Standard caps at 1 day, Enterprise reaches 90, plus a 7-day Fail-safe). But no merge — clones drift apart, nothing to merge back; to diff, write
MINUS/EXCEPTyourself. - Neon (serverless Postgres, acquired by Databricks in 2025): copy-on-write database branches, scale-to-zero, a natural fit for branch-per-PR. But no merge at all — the only parent↔child sync is reset (one-way overwrite), and its "diff" compares schema only, not data rows.
MatrixOne: the full set of row-level git semantics
MatrixOne runs on a distributed, MySQL-compatible engine and implements these git primitives at the row level. SNAPSHOT and PITR span table / database / account / cluster scopes; DATA BRANCH CREATE works at table or database scope, while DIFF / MERGE / PICK are table-to-table operations:
CREATE SNAPSHOT s FOR TABLE db orders; -- snapshot / tag
DATA BRANCH CREATE TABLE db.orders_feature FROM db.orders{SNAPSHOT='s'}; -- branch (a new table)
DATA BRANCH DIFF db.orders_feature AGAINST db.orders{SNAPSHOT='s'} OUTPUT SUMMARY; -- row-level diff
DATA BRANCH MERGE db.orders_feature INTO db.orders WHEN CONFLICT FAIL; -- row-level three-way merge + conflict policy
DATA BRANCH PICK db.orders_feature INTO db.orders KEYS (101, 102) WHEN CONFLICT FAIL; -- cherry-pick (needs a primary key)
RESTORE TABLE db.orders {SNAPSHOT='s'}; -- restore
CREATE PITR p FOR DATABASE db RANGE 1 'd'; -- point-in-time recoveryIn the scenario (within one category, how the three styles each answer):
- Action A (see the change): Dolt and MatrixOne give you a per-row changelist (Dolt down to the cell); Snowflake / Neon have no native row-level diff (write
EXCEPTyourself; Neon compares schema only).
- Action B (parallel merge): Dolt at cell level, MatrixOne at row level — both three-way merge with conflict adjudication; Snowflake has no merge (clones drift), Neon only one-way reset.
- Action C (cherry-pick / pull data): all three pull data via direct SQL; but cherry-pick exists only in Dolt (commit-level) and MatrixOne (row-level
PICK), not Snowflake / Neon.
- Action D (incident recovery): all can go back in time — Dolt
resetto a historical commit, Snowflake time travel +UNDROP, Neon PITR, MatrixOneRESTORE+ PITR.
The differences within this category: only Dolt and MatrixOne bring git down to the row level; Snowflake / Neon have just "clone + time travel," missing merge-with-conflict. And between the two — Dolt has cell-level diff/conflict and a deeper distributed workflow (clone / push / pull / DoltHub, row-level blame); MatrixOne has row-level diff / merge / PICK with conflict policy, and its SNAPSHOT / PITR additionally span table → cluster.
As for "can you run SQL on a version" — every database in this category can, so it isn't a differentiator within the category; this article's focus is git-semantic completeness.
The git-primitive completeness matrix: same word "git," but how much git?
Flatten the four categories and check them off, git primitive by git primitive — this table is the heart of the article:
| Tool | snapshot / time-travel | branch | diff (granularity) | merge (+ conflict) | cherry-pick | restore / PITR | distributed git workflow |
|---|---|---|---|---|---|---|---|
| DVC / Git LFS | via Git | via Git (pointers) | file-level | pointer text merge | ✗ | git checkout | ✓ (Git remote, pointers) |
| lakeFS | ✓ | ✓ | object-level | ✓ object-level (source/dest) | ✓ (object) | revert | partial |
| Iceberg/Delta + Nessie | ✓ | ✓ | snapshot-level (via engine) | ✓ table-snapshot level | ✓ (commit-level) | RESTORE | ✗ |
| Snowflake / BigQuery | ✓ | clone (not a true branch) | ✗ (write EXCEPT) | ✗ | ✗ | UNDROP / clone | ✗ |
| Neon | ✓ (PITR) | ✓ (CoW) | ✗ (schema only) | ✗ (reset only) | ✗ | PITR / reset | ✗ |
| Dolt | ✓ | ✓ | cell-level | ✓ cell 3-way + conflict | ✓ | reset | ✓ (push/pull/DoltHub) |
| MatrixOne | ✓ | ✓ | row-level | ✓ row 3-way + FAIL/SKIP/ACCEPT | ✓ (PICK) | ✓ RESTORE + PITR | ✗ |
Three takeaways:
- Only Dolt and MatrixOne take git down to the row/cell level. The rest stop at coarse granularities — file (DVC), object (lakeFS), snapshot (Iceberg) — or are simply missing merge (Snowflake / Neon).
- Snowflake / Neon's "version control" is really only the first half — clone and go back in time, but no merge-with-conflict, so not complete git.
- Dolt is deep on workflow, MatrixOne is broad on primitives: Dolt adds distributed push/pull/blame; MatrixOne adds conflict policy + cherry-pick + PITR + multi-granularity. The two are the most git-complete pair on this track.
The same edit, seen at wildly different resolutions — the visual version of the matrix's first row (diff granularity):
One overview table
| Category | Representatives | At which layer | Granularity | git-semantic completeness |
|---|---|---|---|---|
| Git-native files | DVC / Git LFS | beside the code repo | file | borrowed from Git, acts on pointers |
| git over object store | lakeFS / Pachyderm | over object storage | object | complete, but stops at object level |
| table format + git | Iceberg/Delta + Nessie | table format + catalog | snapshot | complete, but stops at table-snapshot |
| in-database | Dolt | database kernel | row / cell | most complete at row level (+ distributed workflow) |
| in-database | Snowflake / Neon | database | table / db | only clone + time travel, no merge |
| in-database | MatrixOne | database kernel | row | most complete at row level (+ conflict policy / PICK / PITR; snapshot/PITR span table→cluster) |
The same landscape, as a positioning map — finer granularity to the right, more complete git semantics toward the top. The top-right corner (row-level + complete git) holds Dolt and MatrixOne — exactly the "version control inside the database kernel" category:
MatrixOne's positioning
Condense the map into one position:
MatrixOne ≈ "Dolt's row-level git completeness + a warehouse's zero-copy clone/time-travel + Neon's database branching," with row-level merge-with-conflict, cherry-pick, and PITR all filled in (snapshot/PITR spanning table to cluster), inside one open-source, MySQL-compatible cloud-native HTAP database.
Against the four actions: A see the change (row-level DIFF), B parallel collaboration (row-level three-way merge + conflict policy), C cherry-pick / pull data (PICK + direct SQL), D incident recovery (snapshot + PITR) — MatrixOne is one of a few systems that supports all four both at the row level and in a single engine.
But on the other hand, it's worth being just as clear about what it isn't:
- It doesn't replace DVC: none of the "data+model+code" triad reproduction of
dvc repro/exp— for pure ML pipeline versioning, DVC is still handier.
- It doesn't replace lakeFS: massive byte-level unstructured versioning and cross-format whole-repo atomic commits are lakeFS's turf; MatrixOne versions only the file reference. → best used together.
- Its git workflow trails Dolt's: no distributed git workflow (remotes / push-pull / DoltHub / per-cell blame).
Knowing these differences is what tells a developer when to use MatrixOne, when to use something else, and when to combine them.
📎 Runnable SQL: github.com/matrixorigin/git4data-tutorial | Source & community: github.com/matrixorigin/matrixone


