Use JSON where appropriate
ClickHouse now offers a native JSON column type designed for semi-structured and dynamic data. It's important to clarify that this is a column type, not a data format—you can insert JSON into ClickHouse as a string or via supported formats like JSONEachRow, but that does not imply using the JSON column type. Users should only use the JSON type when the structure of their data is dynamic, not when they simply happen to store JSON.
When to use the JSON type
Use the JSON type when your data:
- Has unpredictable keys that can change over time.
- Contains values with varying types (e.g., a path might sometimes contain a string, sometimes a number).
- Requires schema flexibility where strict typing isn't viable.
If your data structure is known and consistent, there is rarely a need for the JSON type, even if your data is in JSON format. Specifically, if your data has:
- A flat structure with known keys: use standard column types e.g. String.
- Predictable nesting: use Tuple, Array, or Nested types for these structures.
- Predictable structure with varying types: consider Dynamic or Variant types instead.
You can also mix approaches - for example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload.
Considerations and tips for using JSON
The JSON type enables efficient columnar storage by flattening paths into subcolumns. But with flexibility comes responsibility. To use it effectively:
- Specify path types using hints in the column definition to specify types for known sub columns, avoiding unnecessary type inference.
- Skip paths if you don't need the values, with SKIP and SKIP REGEXP to reduce storage and improve performance.
- Avoid setting max_dynamic_pathstoo high - large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000.
Type hits offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for discriminator columns or dynamic resolution during query time. This means that with well-defined type hints, nested JSON fields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.
Advanced features
- JSON columns can be used in primary keys like any other columns. Codecs cannot be specified for a sub-column.
- They support introspection via functions like JSONAllPathsWithTypes()andJSONDynamicPaths().
- You can read nested sub-objects using the .^syntax.
- Query syntax may differ from standard SQL and may require special casting or operators for nested fields.
For additional guidance, see ClickHouse JSON documentation or explore our blog post A New Powerful JSON Data Type for ClickHouse.
Examples
Consider the following JSON sample, representing a row from the Python PyPI dataset:
Lets assume this schema is static and the types can be well defined. Even if the data is in NDJSON format (JSON row per line), there is no need to use the JSON type for such a schema. Simply define the schema with classic types.
and insert JSON rows:
Consider the arXiv dataset containing 2.5m scholarly papers. Each row in this dataset, distributed as NDJSON, represents a published academic paper. An example row is shown below:
While the JSON here is complex, with nested structures, it is predictable. The number and type of the fields will not change. While we could use the JSON type for this example, we can also just define the structure explicitly using Tuples and Nested types:
Again we can insert the data as JSON:
Suppose another column called tags is added. If this was simply a list of strings we could model as an Array(String), but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document:
In this case, we could model the arXiv documents as either all JSON or simply add a JSON tags column. We provide both examples below:
We provide a type hint for the update_date column in the JSON definition, as we use it in the ordering/primary key. This helps ClickHouse to know that this column won't be null and ensures it knows which update_date sub-column to use (there may be multiple for each type, so this is ambiguous otherwise).
We can insert into this table and view the subsequently inferred schema using the JSONAllPathsWithTypes function and PrettyJSONEachRow output format:
Alternatively, we could model this using our earlier schema and a JSON tags column. This is generally preferred, minimizing the inference required by ClickHouse:
We can now infer the types of the sub column tags.
