New York Taxi Data
The New York taxi data sample consists of 3+ billion taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009. This getting started guide uses a 3m row sample.
The full dataset can be obtained in a couple of ways:
- insert the data directly into ClickHouse Cloud from S3 or GCS
- download prepared partitions
- Alternatively users can query the full dataset in our demo environment at sql.clickhouse.com.
The example queries below were executed on a Production instance of ClickHouse Cloud. For more information see "Playground specifications".
Create the table trips
Start by creating a table for the taxi rides:
Load the Data directly from Object Storage
Users' can grab a small subset of the data (3 million rows) for getting familiar with it. The data is in TSV files in object storage, which is easily streamed into
ClickHouse Cloud using the s3 table function.
The same data is stored in both S3 and GCS; choose either tab.
- S3
- GCS
The following command streams three files from an S3 bucket into the trips_small table (the {0..2} syntax is a wildcard for the values 0, 1, and 2):
The following command streams three files from a GCS bucket into the trips table (the {0..2} syntax is a wildcard for the values 0, 1, and 2):
Sample Queries
The following queries are executed on the sample described above. Users can run the sample queries on the full dataset in sql.clickhouse.com, modifying the queries below to use the table nyc_taxi.trips.
Let's see how many rows were inserted:
Each TSV file has about 1M rows, and the three files have 3,000,317 rows. Let's look at a few rows:
Notice there are columns for the pickup and dropoff dates, geo coordinates, fare details, New York neighborhoods, and more.
Let's run a few queries. This query shows us the top 10 neighborhoods that have the most frequent pickups:
This query shows the average fare based on the number of passengers:
Here's a correlation between the number of passengers and the distance of the trip:
Download of Prepared Partitions
The following steps provide information about the original dataset, and a method for loading prepared partitions into a self-managed ClickHouse server environment.
See https://github.com/toddwschneider/nyc-taxi-data and http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html for the description of a dataset and instructions for downloading.
Downloading will result in about 227 GB of uncompressed data in CSV files. The download takes about an hour over a 1 Gbit connection (parallel downloading from s3.amazonaws.com recovers at least half of a 1 Gbit channel). Some of the files might not download fully. Check the file sizes and re-download any that seem doubtful.
If you will run the queries described below, you have to use the full table name, datasets.trips_mergetree.
Results on Single Server
Q1:
0.490 seconds.
Q2:
1.224 seconds.
Q3:
2.104 seconds.
Q4:
3.593 seconds.
The following server was used:
Two Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 16 physical cores total, 128 GiB RAM, 8x6 TB HD on hardware RAID-5
Execution time is the best of three runs. But starting from the second run, queries read data from the file system cache. No further caching occurs: the data is read out and processed in each run.
Creating a table on three servers:
On each server:
On the source server:
The following query redistributes data:
This takes 2454 seconds.
On three servers:
Q1: 0.212 seconds. Q2: 0.438 seconds. Q3: 0.733 seconds. Q4: 1.241 seconds.
No surprises here, since the queries are scaled linearly.
We also have the results from a cluster of 140 servers:
Q1: 0.028 sec. Q2: 0.043 sec. Q3: 0.051 sec. Q4: 0.072 sec.
In this case, the query processing time is determined above all by network latency. We ran queries using a client located in a different datacenter than where the cluster was located, which added about 20 ms of latency.
Summary
| servers | Q1 | Q2 | Q3 | Q4 | 
|---|---|---|---|---|
| 1, E5-2650v2 | 0.490 | 1.224 | 2.104 | 3.593 | 
| 3, E5-2650v2 | 0.212 | 0.438 | 0.733 | 1.241 | 
| 1, AWS c5n.4xlarge | 0.249 | 1.279 | 1.738 | 3.527 | 
| 1, AWS c5n.9xlarge | 0.130 | 0.584 | 0.777 | 1.811 | 
| 3, AWS c5n.9xlarge | 0.057 | 0.231 | 0.285 | 0.641 | 
| 140, E5-2650v2 | 0.028 | 0.043 | 0.051 | 0.072 | 
