Columnar Storage is Normalization • Buttondown

I didn’t understand for some time that the process of transforming row-oriented data into column-oriented data is not a completely specific, foreign concept to the field of databases. This is still of relational abstraction. Or maybe.

As an example, let’s say we have this data:

data = [
    { "name": "Smudge", "colour": "black" },
    { "name": "Sissel", "colour": "grey" },
    { "name": "Hamlet", "colour": "black" }
]

It represents a table in a relational database. let’s assume this Was To access a table and a particular piece of data in a relational database we had to do all kinds of disk-accesses. This presentation has some good qualities.

Adding a new row is easy: we can just create a row:

{ "name": "Petee", "colour": "black" }

And add it to the end of our already existing list. On disk, we probably only need to touch a few pages to do this. And if our row were really wide, with a whole bunch of columns in it, it wouldn’t really change. It will still have that nice property.

this is also true looking up One line. Since all the columns of a row are stored next to each other, it is very fast to retrieve that row from wherever it is stored.

In contrast, if we, say, wanted to calculate histograms of the colors of different pets, we would have to read a lot of data to do that that we don’t care about.

this is one row-oriented Representation of data. A column-oriented The representation would look something like this:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

It has all the opposite tradeoffs of row-oriented design: if we only cared about colourwe can read very effectively Only That data. We don’t need to read their names. But it becomes difficult to modify the data, or read a specific row. We have to go everywhere to do both of them. If we want a second row, we have to go to the second index in each column to reconstruct the original row.

So, one way to think about this size of data is that it is encoding level. It resides at a level of abstraction firmly below the data model: a SQL engine on top of that cannot distinguish between the two, except for the performance characteristics of different queries.

A different way to think about columnarization like this is that it is similar to a very extreme type of database normalization.

Instead of a single detailed table represented by a set of vectors of data, you can think of columnar data as a set of tables, all of which have a primary key and an additional attribute:

Denormalized Table:

+----+------+-----+
| id | name | age |
+----+------+-----+
| 12 | Bob  |  30 |
| 93 | Tom  |  35 |
| 27 | Kim  |  28 |
+----+------+-----+

Generalized Tables:

Name

+----+------+
| id | name |
+----+------+
| 12 | Bob  |
| 93 | Tom  |
| 27 | Kim  |
+----+------+

age

+----+-----+
| id | age |
+----+-----+
| 12 |  30 |
| 93 |  35 |
| 27 |  28 |
+----+-----+

We can easily reconstruct the original table by adding id Pillar.

In the context of a column-stored table, you can think of a primary key as the ordinal position of a given piece of data.

Our basic data:

data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

It seems like:

+----+--------+
| id |   name |
+----+--------+
|  0 | Smudge |
|  1 | Sissel |
|  2 | Hamlet |
+----+--------+

+----+--------+
| id | colour |
+----+--------+
|  0 |  black |
|  1 |   grey |
|  2 |  black |
+----+--------+

however, id Columns are contained only by position in arrays:

+--------+
|   name |
+--------+
| Smudge |
| Sissel |
| Hamlet |
+--------+

+--------+
| colour |
+--------+
|  black |
|   grey |
|  black |
+--------+

I think the value of this perspective is that it integrates many traditional query-processing operations, such as projections and joins, with manipulation of data formats. some, many, most of the times, maybe you Needed Think of data formats like this as an implementation detail, in that queries are logically blind, but it’s a useful mental model to understand that “reconstructing a row from column storage” is not just lookalike A combined display, this Is A connection.



<a href

Leave a Comment