ClickHouse vs Row-Oriented Databases: A Comparative Analysis

This series of articles help me (and maybe others) save some time and brain when coding.

All notes here were learnt by myself when working in Data field. It may not be 100% correct, feel free to leave some comments.

What is ClickHouse?

ClickHouse is a columnar database, and because it stores all data in columnar form, reading data is very fast.

It has its own SQL language, which is somewhat different from other SQL languages in terms of Creating, Updating and Deleting databases, schemas and tables.

Storing data strategy: Row-Oriented Database vs Columnar Database

Visualization of how data is read in Row-Oriented Database and Columnar Database:

Data reading in Row-Oriented Database

Data reading in Columnar Database

More about ClickHouse Basic: https://dev.to/hoptical/clickhouse-basic-tutorial-an-introduction-52il

In my experience, integrating this columnar database with other data tools can be very difficult due to the different translations of its data types, along with many other unknown issues.

1. Auto-increment ID for ClickHouse Primary Key

It does not automatically add an auto-incremental function to the primary key column, unlike other SQL languages. There is no auto-increment function available.

It supports generateUUIDv4: UUID | ClickHouse Docs. The trick is adding generateUUIDv4() function as the default value of the primary key column. It will generate a unique value every time a new row is added.

The drawback of generating uuid is that it is difficult to keep the same id value when manually importing initial data for reference tables, which can happen multiple times during creating database sessions. Each import will generate a different id value and make the referred id in other tables irrelevant. If possible, saving the generated id value for initial data should be done, or a clean database recreation is feasible.

Code samples:

Generate UUID

CREATE OR REPLACE TABLE test_db.transactions (
trans_id VARCHAR(69) DEFAULT generateUUIDv4(),
trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, 'UTC'))
)
ENGINE = MergeTree
PRIMARY KEY (trans_id)
SETTINGS index_granularity=8192;

This code does not generate uuid or incremental id

CREATE OR REPLACE TABLE test_db.transactions (
trans_id UInt64,
trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, 'UTC'))
)
ENGINE = MergeTree
PRIMARY KEY (midpoint_id)
SETTINGS index_granularity=8192;

2. ClickHouse on SQLalchemy

SQLalchemy is a Python library with functions that help reading data from SQL databases in a friendly Python coding style. In other words, SQLalchemy generates SQL queries through writing Python code. SQLalchemy helps utilizing the advantages of Python logic in SQL, especially loop.

SQLalchemy 2, as of February 2024, supports ClickHouse SQL through the clickhouse-sqlalchemy v0.3.0 library. This library is a dialect that translates SQLalchemy ‘language’ to the ‘language’ used by ClickHouse.

Using older versions of clickhouse-sqlalchemy with SQLalchemy 2 can lead to incorrect query generation

3. ClickHouse Data Types Problems

A. Datetime64

Problem: Datetime64 ClickHouse data type is not supported in SQLalchemy.

Solution: All datetime columns must be in unix time (e.g.: 1699328745321)

Example: trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, ‘UTC’))

B. String

Problem: Trino translates ClickHouse string to binary string by default. GraphQL does not support binary string data types.

ClickHouse connector — Trino 436 Documentation

ClickHouse vs Trino data type mapping

Solution: Change config to force Trino translate ClickHouse string to varchar:

clickhouse.map-string-as-varchar=true

Or GraphQL connects directly to ClickHouse.

4. Trino – ClickHouse No Table Modification

Trino does not support modifying (update, delete) ClickHouse tables.

When trying to modify the data, it will throw error about permission. Unfortunately, Trino document about ClickHouse Connector does not mention anything about modifying ClickHouse tables.

ClickHouse connector — Trino 436 Documentation

However, Trino still supports inserting data to ClickHouse tables.

5. Create ClickHouse Distributed Tables

Disclaimer: Hereby is what I know about ClickHouse Distributed Table. The knowledge might be shallow and inaccurate. There will be updates when I learn something new.

Distributed table allows flexible server resources scaling. Data can be saved in multiple nodes and all nodes can be accessed and perform queries at the same time, which will speed up queries.

Minimize image

Edit image

Delete image

Don’t do that

Clickhouse Distributed and Replicated tables architecture

Requirement for creating distributed tables:

ClickHouse Cluster must be the form that ClickHouse server is deployed in.
ZooKeeper or ClickHouse Keeper must be deployed in ClickHouse server.

Steps of creating 1 distributed table:

Step 1: Create 1 replicated table.

Replicated table is the entity that stores data on physical storage. All CRUD (Create, Read, Update, and Delete) actions on data will happen in a replicated table.

CREATE OR REPLACE TABLE test_db.transactions_replicated ON CLUSTER 'default'
(
	trans_id VARCHAR(69) DEFAULT generateUUIDv4(),
	trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, 'UTC'))
)
Engine=ReplicatedMergeTree
PRIMARY KEY (trans_id)
SETTINGS index_granularity=8192
;

Difference from MergeTree tables:

“ON CLUSTER” clause defines which cluster to save the table. Syntax: ON CLUSTER ‘<cluster name>’. Cluster names can be retrieved from query:

SELECT * FROM system.clusters

“Engine=ReplicatedMergeTree” clause defines the table path in the Keeper. ReplicatedMergeTree usually contains arguments, e.g.: ReplicatedMergeTree(‘/clickhouse/tables/{any name or path}’, ‘{replica}’). However, if no arguments are defined, the default arguments will be assigned. As a result, most of the time these arguments should not be a problem, especially when the clusters are internally managed by you.

Step 2: Create 1 distributed table from a replicated table.

Distributed table is a consolidated “gateway” that allows queries to read data on all of the replicas of the table.

CREATE TABLE test_db.transactions ON CLUSTER 'default' AS test_db.transactions_replicated

Engine=Distributed('default', test_db, transactions_replicated, rand())

;

6. Query ClickHouse Distributed Tables

Replicated and Distributed tables will appear and be available for querying in ClickHouse, e.g.: transactions_replicated and transactions tables will appear in test_db schema.

When querying on the Distributed version of the table, the query will access all data of the replicas of the table.

When querying on the Replicated version of the table, the query will access randomly data of 1 replica of the table.

7. Mutate ClickHouse Distributed Tables

Inserting new rows to the table can be done directly through the Distributed version of the table.

Other mutation operations, such as UPDATE, DELETE, DROP, require performing though the Replicated version of the table.

To UPDATE/DELETE/DROP the table, must add a clause: ON CLUSTER ‘{cluster_name}’ (e.g.: ON CLUSTER ‘default’) to UPDATE/DELETE/DROP all replicas of the table. If not using the ON CLUSTER clause, only 1 random replica will be mutated.

IMPORTANT: DROP distributed or replicated table without clause “ON CLUSTER” will remove the table in the schema, but will not delete completely all replicas and the path of storing the table. The path is managed by the Keeper, which will block the path from being reused when recreating the table using the same SQL code. I still have not known how to delete the path in the Keeper yet.

ALTER TABLE {clickhouse_schema}.{replicated_table} ON CLUSTER 'default'

UPDATE {update_clause}

WHERE {where_clause}

More about Data replication and distributed queries: https://posthog.com/handbook/engineering/clickhouse/replication

8. ClickHouse on DBeaver

DBeaver (as v23.3.5) has not yet supported editing ClickHouse tables, including: change column data type, add new column, delete column, change column names….

DBeaver still allows CRUD ClickHouse tables.

Would you like to read more articles by Tekos’s Team? Everything’s here.

Author

Young Duong

Duong is a Senior Data Analyst/Engineer at Tekos Interactive. With enthusiasm for Data Systems and Artificial Intelligence (A.I.) development, he enjoys designing data patterns for data models, training machine learning models and building A.I. applications. He is eager to share with the community about his A.I and data journey.

Coding Note 2: ClickHouse