Continue Coding Note 1: Kafka
This series of articles help me (and maybe others) save some time and brain when coding.
All notes here were learnt by myself when working in Data field. It may not be 100% correct, feel free to leave some comments.
What is ClickHouse?
ClickHouse is a columnar database, and because it stores all data in columnar form, reading data is very fast.
It has its own SQL language, which is somewhat different from other SQL languages in terms of Creating, Updating and Deleting databases, schemas and tables.
Read more: https://clickhouse.com/docs/en/sql-reference/statements
Storing data strategy: Row-Oriented Database vs Columnar Database
Visualization of how data is read in Row-Oriented Database and Columnar Database:
Data reading in Row-Oriented Database
Data reading in Columnar Database
More about ClickHouse Basic: https://dev.to/hoptical/clickhouse-basic-tutorial-an-introduction-52il
In my experience, integrating this columnar database with other data tools can be very difficult due to the different translations of its data types, along with many other unknown issues.
1. Auto-increment ID for ClickHouse Primary Key
It does not automatically add an auto-incremental function to the primary key column, unlike other SQL languages. There is no auto-increment function available.
It supports generateUUIDv4: UUID | ClickHouse Docs. The trick is adding generateUUIDv4() function as the default value of the primary key column. It will generate a unique value every time a new row is added.
The drawback of generating uuid is that it is difficult to keep the same id value when manually importing initial data for reference tables, which can happen multiple times during creating database sessions. Each import will generate a different id value and make the referred id in other tables irrelevant. If possible, saving the generated id value for initial data should be done, or a clean database recreation is feasible.
Code samples:
- Generate UUID
CREATE OR REPLACE TABLE test_db.transactions (
trans_id VARCHAR(69) DEFAULT generateUUIDv4(),
trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, 'UTC'))
)
ENGINE = MergeTree
PRIMARY KEY (trans_id)
SETTINGS index_granularity=8192;
- This code does not generate uuid or incremental id
CREATE OR REPLACE TABLE test_db.transactions (
trans_id UInt64,
trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, 'UTC'))
)
ENGINE = MergeTree
PRIMARY KEY (midpoint_id)
SETTINGS index_granularity=8192;
2. ClickHouse on SQLalchemy
SQLalchemy is a Python library with functions that help reading data from SQL databases in a friendly Python coding style. In other words, SQLalchemy generates SQL queries through writing Python code. SQLalchemy helps utilizing the advantages of Python logic in SQL, especially loop.
SQLalchemy 2, as of February 2024, supports ClickHouse SQL through the clickhouse-sqlalchemy v0.3.0 library. This library is a dialect that translates SQLalchemy ‘language’ to the ‘language’ used by ClickHouse.
Using older versions of clickhouse-sqlalchemy with SQLalchemy 2 can lead to incorrect query generation
3. ClickHouse Data Types Problems
A. Datetime64
Problem: Datetime64 ClickHouse data type is not supported in SQLalchemy.
Solution: All datetime columns must be in unix time (e.g.: 1699328745321)
Example: trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, ‘UTC’))
B. String
Problem: Trino translates ClickHouse string to binary string by default. GraphQL does not support binary string data types.
ClickHouse connector — Trino 436 Documentation
ClickHouse vs Trino data type mapping
Solution: Change config to force Trino translate ClickHouse string to varchar:
clickhouse.map-string-as-varchar=true
Or GraphQL connects directly to ClickHouse.
4. Trino – ClickHouse No Table Modification
Trino does not support modifying (update, delete) ClickHouse tables.
When trying to modify the data, it will throw error about permission. Unfortunately, Trino document about ClickHouse Connector does not mention anything about modifying ClickHouse tables.
ClickHouse connector — Trino 436 Documentation
However, Trino still supports inserting data to ClickHouse tables.
5. Create ClickHouse Distributed Tables
Disclaimer: Hereby is what I know about ClickHouse Distributed Table. The knowledge might be shallow and inaccurate. There will be updates when I learn something new.
Distributed table allows flexible server resources scaling. Data can be saved in multiple nodes and all nodes can be accessed and perform queries at the same time, which will speed up queries.
Minimize image
Edit image
Delete image
Clickhouse Distributed and Replicated tables architecture
Requirement for creating distributed tables:
- ClickHouse Cluster must be the form that ClickHouse server is deployed in.
- ZooKeeper or ClickHouse Keeper must be deployed in ClickHouse server.
Steps of creating 1 distributed table:
Step 1: Create 1 replicated table.
Replicated table is the entity that stores data on physical storage. All CRUD (Create, Read, Update, and Delete) actions on data will happen in a replicated table.
CREATE OR REPLACE TABLE test_db.transactions_replicated ON CLUSTER 'default'
(
trans_id VARCHAR(69) DEFAULT generateUUIDv4(),
trans_date_crea UInt64 DEFAULT toUnixTimestamp64Milli(now64(3, 'UTC'))
)
Engine=ReplicatedMergeTree
PRIMARY KEY (trans_id)
SETTINGS index_granularity=8192
;
Difference from MergeTree tables:
- “ON CLUSTER” clause defines which cluster to save the table. Syntax: ON CLUSTER ‘<cluster name>’. Cluster names can be retrieved from query:
SELECT * FROM system.clusters
- “Engine=ReplicatedMergeTree” clause defines the table path in the Keeper. ReplicatedMergeTree usually contains arguments, e.g.: ReplicatedMergeTree(‘/clickhouse/tables/{any name or path}’, ‘{replica}’). However, if no arguments are defined, the default arguments will be assigned. As a result, most of the time these arguments should not be a problem, especially when the clusters are internally managed by you.
Step 2: Create 1 distributed table from a replicated table.
Distributed table is a consolidated “gateway” that allows queries to read data on all of the replicas of the table.
CREATE TABLE test_db.transactions ON CLUSTER 'default' AS test_db.transactions_replicated
Engine=Distributed('default', test_db, transactions_replicated, rand())
;
6. Query ClickHouse Distributed Tables
Replicated and Distributed tables will appear and be available for querying in ClickHouse, e.g.: transactions_replicated and transactions tables will appear in test_db schema.
When querying on the Distributed version of the table, the query will access all data of the replicas of the table.
When querying on the Replicated version of the table, the query will access randomly data of 1 replica of the table.
7. Mutate ClickHouse Distributed Tables
Inserting new rows to the table can be done directly through the Distributed version of the table.
Other mutation operations, such as UPDATE, DELETE, DROP, require performing though the Replicated version of the table.
To UPDATE/DELETE/DROP the table, must add a clause: ON CLUSTER ‘{cluster_name}’ (e.g.: ON CLUSTER ‘default’) to UPDATE/DELETE/DROP all replicas of the table. If not using the ON CLUSTER clause, only 1 random replica will be mutated.
IMPORTANT: DROP distributed or replicated table without clause “ON CLUSTER” will remove the table in the schema, but will not delete completely all replicas and the path of storing the table. The path is managed by the Keeper, which will block the path from being reused when recreating the table using the same SQL code. I still have not known how to delete the path in the Keeper yet.
ALTER TABLE {clickhouse_schema}.{replicated_table} ON CLUSTER 'default'
UPDATE {update_clause}
WHERE {where_clause}
More about Data replication and distributed queries: https://posthog.com/handbook/engineering/clickhouse/replication
8. ClickHouse on DBeaver
DBeaver (as v23.3.5) has not yet supported editing ClickHouse tables, including: change column data type, add new column, delete column, change column names….
DBeaver still allows CRUD ClickHouse tables.
Would you like to read more articles by Tekos’s Team? Everything’s here.