MySQL vs Postgres

Pros and Cons of MySQL over Postgres

MySQLPostgres
– more popular and widely used
– more learning resources
– uses threads instead of processes for connections
– better indexing. secondary indexes point to primary index.
– read Uber’s blog why they switched from Postgres to MySQL.
– Postgres has no show create table command that comes with MySQL
– materialized views (although they don’t auto-refresh)
– multiple index types (B-Tree, Hash, BRIN, GIN, GIST)
– custom types
– arrays
– extensions (plugins)
– stored procedures can be written in multiple languages
– generally accepted to have more bells and whistles than MySQL
– developer community seems to be more friendly.

Performance-wise I think both are close. There are popular tools like sysbench to compare performance but I feel the test cases are a bit contrived and don’t reflect real-world usage. Also the winner might change from year to year like the car rankings in consumer reports.

What do you think? Which one do you prefer? The elephant or the dolphin? Let me know.

Posted in Uncategorized | Leave a comment

Pros and Cons of Cloud SQL vs. self-managed MySQL or Postgres

If you want to run a MySQL or Postgres server in GCP you have two options: you can either use the managed service provided by Google or you can provision a VM and install MySQL or Postgres yourself (I call this self-managed). Here I summarize the pros and cons of Cloud SQL vs. self-managed instance of MySQL or Postgres:

ProsCons
– easy setup through UI or gcloud command line
– automated backups and replication
– storage is autoscaling. you never have to worry about adding more storage
– built-in monitoring and dashboards
– Cloud SQL costs roughly twice as much as self-managed instance
– In Cloud SQL, there is no way to SSH in to the server. so you are limited in what you can do. only access is through mysql client or psql

Let me know what you think and which option you prefer.

Posted in Computers, programming, Software | Tagged | Leave a comment

Swift vs. C# Performance

I am always a sucker for performance. Recently I migrated a .NET GDI+ app to Swift. There were several reasons for it:

  • GDI+ is no longer supported on MacOS with .NET 7.0
  • I thought using the native MacOS graphics libraries would be faster than running .NET on MacOS
  • I just wanted to learn Swift and its Graphics API

To my surprise the C# code runs way faster than Swift. The difference is not because of GDI+ vs Swift’s CoreGraphics but because of the more basic non-graphics operations. Think of the basic statements that come with a programming language. When I profiled the code, I found performance of GDI+ vs CoreGraphics was the same. The part where the performance differed was in executing non-graphics code. I wasn’t using any libraries at all in this code – just using the statements that come with the programming language.

I kept on digging further into the rabbit hole. What do we find? Let’s create a 2D array of bools and initialize it to random values. Below is the code in C# and Swift.

C#

public static int test(int width, int height) {
            var rand = new Random();
            var mask = new bool[height][];
            for (int i = 0; i < mask.Length; i++)
            {
                mask[i] = new bool[width];
            }
            int count = 0;
            for(int row = 0; row < height; row++) {
                for(int col = 0; col < width; col++) {
                    if (rand.Next(0, 2) == 1) {
                        mask[row][col] = true;
                        count++;
                    }
                }
            }
            return count;            
        }

Swift

static func test(_ width: Int, _ height: Int) -> UInt {
        var mask = Array(repeating: Array(repeating: false, count: width), count: height)
        var count: UInt = 0
        for row in 0..<height {            
            for col in 0..<width {
                if (Bool.random()) {
                    mask[row][col] = true
                    count += 1
                }
            }
        }
        return count
    }

What do we find when we run this code? When I did 100 iterations of above code with width = height = 3840 on a M1 Mac mini with 8GB RAM running Mac OS 13.1 (Ventura) here are the runtimes I got:

C# (.NET 6.0): 00:00:20.41 s

Swift (version 5.7.2): 157.7 s

The difference is staggering. C# is almost 8x faster than Swift! Let me know what you think.

For reference below is C# code to do 100 iterations:

Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();
            int width = Int32.Parse(args[0]);
            int height = Int32.Parse(args[1]);
            int n = Int32.Parse(args[2]);
            for (int i = 0; i < n; i++) {
                Array2D.test(width, height);
            }
            stopWatch.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts = stopWatch.Elapsed;
            Console.WriteLine("completed {0} iterations", n);
            // Format and display the TimeSpan value.
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts.Hours, ts.Minutes, ts.Seconds,
                ts.Milliseconds / 10);
            Console.WriteLine("RunTime " + elapsedTime);

and corresponding Swift code:

let width = Int(CommandLine.arguments[1])!
        let height = Int(CommandLine.arguments[2])!
        let n = Int(CommandLine.arguments[3])!
        print("running \(n) loops of \(width)x\(height) 2d array test...")
        let clock = SuspendingClock()
        let t1 = clock.now
        for i in 1...n {
            // print("iteration \(i) of \(n)")
            test(width, height)
        }
        let t2 = clock.now
        print("\(t2-t1)")
Posted in Computers, programming, Software | Tagged | Leave a comment

Deleting Entities in Bulk from Google Datastore

The easiest way to do this seems to be using Dataflow. Here is sample Dataflow job to delete all entities of kind foo in namespace bar:

gcloud dataflow jobs run delete-all-entities \
 --gcs-location gs://dataflow-templates-us-central1/latest/Firestore_to_Firestore_Delete \
 --region us-central1 \
 --staging-location gs://my-gcs-location/temp \
 --project=my-gcp-project \
 --parameters firestoreReadGqlQuery='select __key__ from foo',firestoreReadNamespace='bar',firestoreReadProjectId=my-gcp-project,firestoreDeleteProjectId=my-gcp-project

As example a job to delete 44,951,022 entities with default autoscaling took 1 hr 39 min or about 7,567 entities / sec. Here are the complete stats:

One of the problems with Dataflow is that Google does not provide the cost of a job. However, you can estimate it yourself using the information on their site. In this particular case, this job consumed following resource metrics:

These metrics are staggering btw! How much do you think the job would have cost? Think about it. We can calculate it as:

>>> 270.165*0.056+1013.12*0.003557+6754.131*0.0000054
Posted in Computers, programming, Software | Tagged , | Leave a comment

Performance Benchmarking Various Databases

This post summarizes results of performance benchmarking some databases. For our test, we use a dataset similar to the Star Schema Benchmark and evaluate the performance on following queries:

  • what is the time to insert a line order item?
  • what is the time to fetch all the line orders for a customer?

we created a dataset with 44 M line orders. All databases (except Datastore which is serverless) use 8 vCPU and 32 GB RAM. For MySQL this resulted in a table with data_length = 7,269,679,104 bytes. below are the results:

TaskMySQLAlloyDBMongoDBRocksDBGoogle DatastoreBigTable
Load all 44 M rows30 min 35.524 sec (24,491 rows/s)17 min 53.975 s (41,857 rows/s)10 min 5s8 min 12 s140 min17 min 39 s
Time to insert new row5 ms5 ms1 ms< 1 ms115 ms1000 ms
Time to fetch all line orders corresponding to a customer1-2 ms2 ms9 ms8 ms134 ms700 ms
How data was loadedLOAD DATA LOCAL INFILE\copy FROM ...Node.js
How tests were doneMySQL CLIAlloyDB CLI (Postgres)MongoDB CLIRocksJavaNode.jsJava

Overall, MySQL, AlloyDB, RocksDB and MongoDB perform close to each other and the numbers are inconclusive so as to suggest one over the other. The results confirm the belief that B-Tree based databases are optimized for minimizing the seek time to read whereas Log Structured Merge Tree (SS Table) databases are optimized for write throughput. The times for Datastore and BigTable are significantly higher presumably because they include the overhead of a network call. However, the network call happens in case of MySQL also when using the CLI (MySQL client calls remote MySQL server) and when we benchmarked MySQL with a Node.js application using the mysql2 driver, response times were practically the same as when using the CLI.

For MySQL there is almost a 6x improvement (in time) if the records are inserted in PK order, not to mention reduced data size (data_length). See this for example.

The results are not intended to suggest any database over another. There are many other factors to consider when making a decision.

Posted in Computers, programming, Software | Leave a comment

Measuring the OLTP Performance of AlloyDB against Postgres using sysbench

AlloyDB is a new HTAP database that Google claims to give 4x better OLTP performance than Postgres. Here, we describe a performance test to compare its performance to Postgres as I wanted to see it for myself. TL;DR: For the tests we did, we find the performance is the same with little to no difference at all. This is not to challenge Google’s claim as the results vary depending on how the databases are configured, what data is stored and what queries are run etc.

The Setup

For our Postgres server we use a VM with 8 vCPU, 32 GB RAM and 250 GB SSD. This costs $447 / month. Refer this.

For our AlloyDB server we use a VM with 4 vCPU, 32 GB RAM and 250 GB SSD. This costs 48.24*4+8.18*32+0.3*250 = $530 / month. Refer this. Although a VM with 2 vCPU would be closer in cost to $447, the GCloud console did not provide that configuration as an option.

shared_buffers is an important setting and is found to be set to 10704 MB for Postgres and 25690 MB for AlloyDB by GCloud – we did not customize the settings.

We downloaded and installed sysbench 1.0.20 on Debian. Log file. sysbench test scripts are located under /usr/share/sysbench.

Preparing Test Table

A test table with 1B rows was created in both Postgres and AlloyDB using following command as example:

sysbench \
	--db-driver=pgsql \
--table_size=1000000000 \
--tables=1 \
--threads=10 \
--pgsql-host=xxx \
--pgsql-port=5432 \
--pgsql-user=sysbench \
--pgsql-db=sysbench \
--pgsql-password=xxx \
oltp_common \
prepare

The time to create this table was 638m53.148s for Postgres and 563m30.938s for AlloyDB. Its similar.

The table schema is simply:

sysbench=> \d sbtest1
                                Table "public.sbtest1"
 Column |      Type      | Collation | Nullable |               Default
--------+----------------+-----------+----------+-------------------------------------
 id     | integer        |           | not null | nextval('sbtest1_id_seq'::regclass)
 k      | integer        |           | not null | 0
 c      | character(120) |           | not null | ''::bpchar
 pad    | character(60)  |           | not null | ''::bpchar
Indexes:
    "sbtest1_pkey" PRIMARY KEY, btree (id)
    "k_1" btree (k)

Thus each row is 188 bytes in size and 1B rows gives us 188 GB of data. The actual size was measured using

sysbench=> SELECT ROUND(pg_total_relation_size (oid) / ( 1024.0 * 1024.0 ), 2) AS
       "Size in MB"
FROM   pg_class c
WHERE  relname = 'sbtest1';

and found to be 242,437.01 MB for Postgres and 242,650.06 MB for AlloyDB. They are practically the same.

A huge table was created to make sure the table size is more than the available RAM. This will drive the performance in the red zone.

Running the tests

After this we ran all the test scripts under /usr/share/sysbench with 100 threads.

sysbench \
	--db-driver=pgsql \
--threads=100 \
--pgsql-host=xxx \
--pgsql-port=5432 \
--pgsql-user=sysbench \
--pgsql-db=sysbench \
--pgsql-password=xxx \
$1 \
run

This is what we found (tx/s = transactions per second):

TestPostgres (tx/s)AlloyDB (tx/s)
oltp_read_write6.639.16
oltp_delete2540.652532
oltp_insert23982417
oltp_point_select25472567
oltp_read_only152160
oltp_update_index25512564
oltp_update_non_index25492566
oltp_write_only2319
select_random_points25432561
select_random_ranges25132541

Conclusion

The results are practically the same with little to no difference. What about MySQL? We did perform the same analysis on MySQL 8.0 (using similar VM as Postgres). Here are the results:

Task
create table with 1B rows
672m18.342s
data_length
207,806,791,680

Running the sysbench tests however gave this error: Commands out of sync; you can't run this command now and prevented further testing. This is a bug in sysbench fixed in this commit but not released at time of this writing. I am actually surprised how people use sysbench to test MySQL when its broken out of the box.

Further Reading

Posted in Computers, programming, Software | Tagged , | Leave a comment

AlloyDB Test Drive

TL;DR:

  • AlloyDB can be very fast (comparable to any other state-of-the-art real-time analytics db) but only if the entire columns on which a query depends can be fit into the memory of its columnar engine.
  • This constraint does not scale. we can’t keep increasing RAM to accommodate ever-increasing datasets.

Background

Historically two distinct database access patterns emerged – OLTP and OLAP – and databases were developed to cater to either OLTP or OLAP. OLTP databases would capture data from transactional systems (e.g., sales orders) and the data would be copied to an OLAP database or warehouse for analysis through a process known as ETL. Entire companies made (and continue to make) their livelihood developing ETL software which is just a fancy term to copy data from one db to another. However, as data volumes grow to ever-increasing sizes, copying the data becomes expensive both in terms of cost and time. In some cases by the time the data hits OLAP database, it is already stale (i.e., OLAP database is always lagging behind the real-time OLTP systems). Also consider the emergence of user-facing analytical applications. E.g., Uber has a treasure trove of ride data. An application can be built that would allow a user to select a starting and ending point on a map, a radius around start and end points (1 km as example), and a time-period (e.g., past 1 month), and it would fetch all the rides that satisfy those constraints and display the average time, std deviation, even the histogram etc. all in real-time, on-demand using the most up-to-date data. It is not possible to pre-compute the results for all possible inputs a user may enter: the range of inputs is practically infinite.

How do we make this possible? To address above problems and new use-cases, a new breed of databases has now emerged – so-called Hybrid Transactional and Analytical Processing or HTAP databases. These databases aim to unify OLTP and OLAP so that you don’t have to maintain two systems, copy data back and forth, and can run both types of workloads from a single database. There are tons of examples: Snowflake’s Unistore, SingleStore, Amazon Aurora, MySQL HeatWave, TiDB, Citus etc. This brings us to AlloyDB which is Google’s foray into this area.

The way all these databases work is by maintaining both a OLTP (row-based) and OLAP (columnar engine) store underneath and automatically routing a query to either OLTP or OLAP depending on which will be faster. The database also takes care of syncing the data between the two stores.

Performance Benchmarking of AlloyDB

The performance of AlloyDB was measured on the Star Schema Benchmark with 2B line order items (250 GB dataset). A cluster with 64 GB RAM (8 vCPUs) was provisioned and 32 GB of RAM was dedicated to the columnar engine. We were only able to load a few of the columns into the columnar engine because of the size limitations. It actually took 6.5 hours! to load the lineorder table from a CSV file (compare to 30 min for ClickHouse). Creating the denormalized table (lineorder_flat) took another 10.5 hours!! and loading the lo_quantity, lo_extendedprice, lo_discount, lo_orderdate columns into the CE (columnar engine) took us following times:

So it was not an easy process by all means. We partitioned the table by year using following commands as example:

CREATE TABLE lineorder
(
    row_id serial,
    LO_ORDERKEY             integer,
    LO_LINENUMBER           integer,
    LO_CUSTKEY              integer,
    LO_PARTKEY              integer,
    LO_SUPPKEY              integer,
    LO_ORDERDATE            Date,
    LO_ORDERPRIORITY        varchar(99),
    LO_SHIPPRIORITY         smallint,
    LO_QUANTITY             smallint,
    LO_EXTENDEDPRICE        integer,
    LO_ORDTOTALPRICE        integer,
    LO_DISCOUNT             smallint,
    LO_REVENUE              integer,
    LO_SUPPLYCOST           integer,
    LO_TAX                  smallint,
    LO_COMMITDATE           Date,
    LO_SHIPMODE             varchar(99),
    junk varchar(4),
    primary key (lo_orderdate, serial)
) PARTITION BY RANGE (lo_orderdate);

CREATE TABLE lineorder_y1991 PARTITION OF lineorder FOR VALUES FROM ('1991-01-01') TO ('1992-01-01');

The command to load a column is (illustrated with an example):

SELECT google_columnar_engine_add('lineorder_flat_y1994', 'lo_orderdate');

Once the columns are loaded, we can observe the space taken as follows (click on the image to enlarge):

And below are the execution times on the SSB queries before we added the columns and after we added the columns. The execution time of ClickHouse is shown for comparison:

Below is the query plan that compares how a query is executed when not using the CE (left) and when AlloyDB is able to use the CE (right) (click on the image to enlarge)

Conclusion

The lesson learned here is that yes AlloyDB can be very fast (6000x faster!!! for query 1.2 as example when we looked at actual execution time 784337 ms vs. 127 ms) but only if the entire columns on which a query depends can be fit into its RAM (doesn’t matter if the column is in WHERE clause or SELECT clause or any other clause for that matter) – this is not a realistic constraint.

Other Notes

PostgreSQL provides Extract(year from date) method to get the year of a date column. However, using Extract(year from date) as below e.g.:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE extract(year from LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;

PostgreSQL (and thus AlloyDB) was still searching all the partitions. We have to rewrite the query as:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE LO_ORDERDATE >= '1993-01-01' and lo_orderdate <= '1993-12-31' AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;

for it to only search within the 1993 partition.

Similarly, query 1.3 takes long even though all the columns it depends on are in the CE. This is because of the presence of extract(week from LO_ORDERDATE) in the query:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE extract(week from LO_ORDERDATE) = 6 AND LO_ORDERDATE >= '1994-01-01' and lo_orderdate <= '1994-12-31'
  AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;

Below are the query plans before adding columns to CE and after:

We see on the right that although it does a Parallel Custom Scan (columnar scan), the Columnar cache search mode is row store as opposed to native in queries 1.1 and 1.2. In the end, the column store is not able to make the query faster.

The trick to make the query run fast is to rewrite it as:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE LO_ORDERDATE >= '1994-02-07' and lo_orderdate <= '1994-02-13'
  AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;

This query executes in just 108 ms with following query plan:

                                                                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=4233.50..4233.51 rows=1 width=8) (actual time=98.859..102.062 rows=1 loops=1)
   ->  Gather  (cost=4233.28..4233.49 rows=2 width=8) (actual time=98.559..102.048 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=3233.28..3233.29 rows=1 width=8) (actual time=90.280..90.282 rows=1 loops=3)
               ->  Parallel Append  (cost=0.00..2602.86 rows=126084 width=6) (actual time=0.895..90.266 rows=1 loops=3)
                     ->  Parallel Custom Scan (columnar scan) on lineorder_flat_y1994 lineorder_flat  (cost=20.00..2598.85 rows=126083 width=6) (actual time=0.894..90.258 rows=105834 loops=3)
                           Filter: ((lo_orderdate >= '1994-02-07'::date) AND (lo_orderdate <= '1994-02-13'::date) AND (lo_discount >= 5) AND (lo_discount <= 7) AND (lo_quantity >= 26) AND (lo_quantity <= 35))
                           Rows Removed by Columnar Filter: 100926693
                           Rows Aggregated by Columnar Scan: 37446
                           Columnar cache search mode: native
                     ->  Parallel Seq Scan on lineorder_flat_y1994 lineorder_flat  (cost=0.00..4.01 rows=1 width=6) (never executed)
                           Filter: ((lo_orderdate >= '1994-02-07'::date) AND (lo_orderdate <= '1994-02-13'::date) AND (lo_discount >= 5) AND (lo_discount <= 7) AND (lo_quantity >= 26) AND (lo_quantity <= 35))
 Planning Time: 25.348 ms
 Execution Time: 108.731 ms
(15 rows)

And finally…

The application described in the background section was actually developed at Uber and is called Uber Movement. When it was developed in 2016, HTAP databases had not appeared and so what we did is to discretize a city into zones with fixed boundaries. User could not select any arbitrary start and end points on the map, they could select start and end zones. This allowed us to reduce the range of possibly infinite inputs to a more finite number and then we precomputed aggregates to serve the analytics to the user from the web interface. Computing the aggregates in real-time, on-demand was infeasible with the tools we had at that time. You can read more about the whole process in this whitepaper.

Posted in Computers, programming, Software | Tagged | Leave a comment

MySQL – 10 Performance Tips

  1. Always create your own primary key. [ref]. The PK controls how your data is physically laid out.
  2. Always try to insert records in PK order [ref]. Usually this means use AUTO_INCREMENT for the PK. Using UUID is a very poor choice. A commonly encountered situation is when you are ingesting data from elsewhere and the data already uses a UUID to uniquely identify records. In this case you are stuck between a rock and a hard place.
    – Option 1: You use AUTO_INCREMENT for PK. Your writes will be fast but you lose any data-integrity checks and you won’t be able to lookup by UUID (reads are slow).
    – Option 2: You use the UUID field for PK. You are guaranteed data-integrity (no duplicates) but your writes will suffer and index will bloat up significantly.
    – Option 3: You use AUTO_INCREMENT for PK and CREATE UNIQUE INDEX on the UUID column. I suspect this will give same performance (or worse) than Option 2.
    – Option 4 (not always possible): Use Option 3 but disable the index when doing inserts [ref]. Only works when you are able to do inserts in bulk. Not possible if you need to do inserts continuously as is the case in lot of applications.
  3. For high-performance applications, don’t use FKs. Rely on your application to maintain data-integrity.
  4. For high-performance, avoid SELECT ... FOR UPDATE and instead rely on application-level optimistic concurrency control using a version column.
  5. Single most important setting to get right is the innodb_buffer_pool_size. Set it large enough so your indexes can fit in memory. But it should be less than 75% of RAM. Give some space to the OS, connections etc. Do some research on SO etc.
Posted in Computers, programming, Software | Tagged | Leave a comment

Measuring performance of BigQuery vs. Clickhouse

This post summarizes results of a performance test to comapare BigQuery vs. Clickhouse – two high-performance analytics databases.

Methodology

Star Schema Benchmark was used for the test. dbgen tool (commit 0741e06d4c3e811bcec233378a39db2fc0be5d79) was run with a SF of 333 to generate the dataset. Here are the details:

Table# of rows
customer.tbl9,990,000
supplier.tbl666,000
part.tbl1,800,000
date.tbl2,556
lineorder.tbl1,997,997,687

All in all this generated about 225 GB of data. Clickhouse loaded all the 2B rows in an impressive 30 min time period.

$ time clickhouse-client --password --query "INSERT INTO ssb.lineorder FORMAT CSV" < /app/ssb-data/lineorder.tbl
Password for user (default):

real	29m56.227s
user	94m10.616s
sys	5m7.183s

After this the queries described at [1] were run on BigQuery and Clickhouse. In case of BigQuery, the SQL code needs to be adapted as follows:

ClickhouseBigQuery
PARTITION BY toYear(LO_ORDERDATE)partition by date_trunc(lo_orderdate, year)
ORDER BY (LO_ORDERDATE, LO_ORDERKEY)cluster BY LO_ORDERDATE, LO_ORDERKEY
toYear(LO_ORDERDATE) = 1993extract(year from LO_ORDERDATE) = 1993
toISOWeek(LO_ORDERDATE) = 6extract(ISOWeek from LO_ORDERDATE) = 6
toYYYYMM(LO_ORDERDATE) = 199712extract(year from LO_ORDERDATE) = 1997 and extract(month from LO_ORDERDATE) = 12

The first thing which impressed me is that the raw CSV lineorder table measured 239,234,458,925 bytes but it got reduced to 57 GB when ingested into Clickhouse and 48 GB in case of BigQuery

Lineorder table in BigQuery

The denormalized lineorder_flat table has following stats in BQ:

In case of Clickhouse, running this script gave us:

┌─parts.database─┬─parts.table─────────────┬───────rows─┬─latest_modification─┬─disk_size──┬─primary_keys_size─┬─engine────┬───bytes_size─┬─compressed_size─┬─uncompressed_size─┬────────────────ratio─┐
│ ssb            │ lineorder_flat          │ 1997997687 │ 2022-10-15 01:13:59 │ 178.38 GiB │ 1.40 MiB          │ MergeTree │ 191538385255 │ 178.09 GiB      │ 323.22 GiB        │   0.5510020049576446 │
│ ssb            │ lineorder               │ 1997997687 │ 2022-10-17 05:22:17 │ 55.59 GiB  │ 1.40 MiB          │ MergeTree │  59685552796 │ 55.47 GiB       │ 80.01 GiB         │    0.693277565266494 │
│ ssb            │ customer                │    9990000 │ 2022-10-14 02:26:59 │ 382.37 MiB │ 4.79 KiB          │ MergeTree │    400943880 │ 382.03 MiB      │ 562.22 MiB        │   0.6795006124121903 │
│ system         │ trace_log               │    3652777 │ 2022-10-17 17:46:08 │ 58.67 MiB  │ 2.69 KiB          │ MergeTree │     61520620 │ 58.15 MiB       │ 1.06 GiB          │  0.05349937527433843 │
│ ssb            │ part                    │    1800000 │ 2022-10-14 02:27:57 │ 31.03 MiB  │ 892.00 B          │ MergeTree │     32541446 │ 30.96 MiB       │ 44.07 MiB         │   0.7024153306907351 │
│ system         │ asynchronous_metric_log │   78135224 │ 2022-10-17 17:46:13 │ 28.09 MiB  │ 95.36 KiB         │ MergeTree │     29458819 │ 26.76 MiB       │ 1.10 GiB          │  0.02382792149791971 │
│ system         │ metric_log              │     295379 │ 2022-10-17 17:46:09 │ 27.12 MiB  │ 696.00 B          │ MergeTree │     28442229 │ 25.86 MiB       │ 1.02 GiB          │ 0.024746974449290835 │
│ ssb            │ supplier                │     666000 │ 2022-10-14 02:28:33 │ 25.10 MiB  │ 332.00 B          │ MergeTree │     26323607 │ 25.08 MiB       │ 36.85 MiB         │   0.6807884856313037 │
│ system         │ part_log                │      35368 │ 2022-10-17 06:14:37 │ 1.36 MiB   │ 48.00 B           │ MergeTree │      1425397 │ 0.00 B          │ 0.00 B            │                  nan │
│ system         │ query_log               │        110 │ 2022-10-17 17:45:46 │ 39.87 KiB  │ 24.00 B           │ MergeTree │        40827 │ 0.00 B          │ 0.00 B            │                  nan │
└────────────────┴─────────────────────────┴────────────┴─────────────────────┴────────────┴───────────────────┴───────────┴──────────────┴─────────────────┴───────────────────┴──────────────────────┘

This is commendable and a result of column compression done by both databases.

Results

Without much ado, here is the time taken by both databases on the 13 queries in the Star Schema Benchmark.

BigQuery (s)ClickHouse (s)
Q1.11.3155.991
Q1.21.2510.460
Q1.31.6370.108
Q2.12.50733.767
Q2.22.3869.579
Q2.31.7288.482
Q3.12.36715.779
Q3.23.4213.690
Q3.34.0569.466
Q3.41.7770.158
Q4.12.47925.276
Q4.22.0944.945
Q4.32.2133.874
Time taken by BigQuery and ClickHouse to execute queries in the Star Schema Benchmark
Execution time of BigQuery vs. Clickhouse

We can easily see BigQuery is faster. I suspect this is because Clickhouse was run on a single node (8 vCPUs and 32 GB RAM). E.g., when the job details of Q2.1 were inspected,

SELECT
    sum(LO_REVENUE) as revenue,
    extract(Year from LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_CATEGORY = "MFGR#12" AND S_REGION = "AMERICA"
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND

BiqQuery shows 3 stages:

and drilling further, it was seen there were 3847 parallel inputs in Stage 1 (S00). Both BigQuery and Clickhouse did a full table scan but BigQuery massively parallelized the work across 3847 workers (slots). Its hard to beat this. The total slot milliseconds of BigQuery came in at 311,716 ms.

It is also interesting to compare the execution details of Q1.3 where Clickhouse beats BigQuery (0.108ms for Clickhouse vs 1.6s for BQ). The query itself is:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE (toISOWeek(LO_ORDERDATE) = 6) AND (toYear(LO_ORDERDATE) = 1994) AND ((LO_DISCOUNT >= 5) AND (LO_DISCOUNT <= 7)) AND ((LO_QUANTITY >= 26) AND (LO_QUANTITY <= 35))

Clickhouse was blazingly fast on this query and was seen to process 5.89 million rows whereas BigQuery for some reason processed 303,097,581 rows (which was also equal to the total number of rows in the partition containing the year 1994 and 50 times more than 5.89M; recall there are 52 weeks in a year) across 587 workers.

It is instructive to compare the # of rows processed between the two. This is shown by following graph:

It can be seen that the cases when Clickhouse is faster (Q1.2, Q1.3 and Q 3.4) can be attributed to scanning through a much smaller subset of the data. In all these cases the query had a where filter which was more fine-grained than just the year. E.g., in Q1.2 has predicate on month, Q1.3 has predicate on ISO week and Q3.4 has predicate on month again.

Conclusion

This brings us to the conclusion. Although we can see that BigQuery is faster, this post is not meant to endorse BigQuery over Clickhouse. Clickhouse was run on a single node – who knows what the numbers would be if its run on a cluster? But the question is how do you decide how big of a cluster to provision? And once you have provisioned the cluster, its static – you can’t resize it – or maybe you can but it won’t be trivial and not something you would do often. This is where BigQuery really shines and is the most important lesson learned. It dynamically auto-scales the compute in response to the query. It might not do as clever of indexing (refer [2] and [3]) but its hard to beat the massive parallelization it does behind the scenes. It can pretty much brute-force any query and still come on top.

Posted in Computers, programming, Software | Tagged , | Leave a comment

Java vs. Node.js: Performance comparison of blocking vs. non-blocking I/O

There are many online articles touting the advantages of non-blocking I/O over blocking I/O. But I wanted to see the difference for myself. So to do that, I developed an application two ways: one using Spring WebMVC + JPA which uses blocking I/O and another using Node.js which uses non-blocking I/O. MySQL database was used in both cases. In case of Node.js I used the Sequelize ORM. Spring JPA uses Hibernate ORM. I wanted the application to be complex enough to mimic a real-world scenario so the test results could be meaningful. Thus our application makes use of transactions, locking and SELECT, INSERT as well as UPDATE statements. In fact, I also made use of protocol buffers and there is some message decoding that happens in the application before it can process a message.

Problem Description

We consider a fictitious company selling some sort of products and what we want to do is to compute half-yearly sales of each product. E.g., imagine Toyota selling cars and it wants to calculate how many Toyota Corollas were sold in H1 2022 (H1 stands for first-half of the year) or Microsoft selling Office Subscription. For this we create 2 tables – one stores each booking or sale and another stores the aggregate sales divided over the different fiscal periods. The details of the tables are given below but not important. What is important is that we execute INSERT statements on the bookings table for each new sale. And need to INSERT or UPDATE values in the aggregates table. Further the operations need to be wrapped in a transaction and we also have to lock rows in the aggregates table using SELECT FOR UPDATE before updating them. So you can see we are not developing a trivial or toy application.

Bookings Table

+--------------+---------------+------+-----+-------------------+-------------------+
| Field        | Type          | Null | Key | Default           | Extra             |
+--------------+---------------+------+-----+-------------------+-------------------+
| row_id       | int           | NO   | PRI | NULL              | auto_increment    |
| booking_id   | varchar(255)  | NO   |     | NULL              |                   |
| created_at   | timestamp     | NO   |     | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
| dollar_value | decimal(19,2) | NO   |     | NULL              |                   |
| end_date     | date          | NO   |     | NULL              |                   |
| origin_ts    | timestamp     | NO   |     | NULL              |                   |
| parent_id    | varchar(255)  | YES  |     | NULL              |                   |
| trim          | varchar(255)  | NO   |     | NULL              |                   |
| model          | varchar(255)  | NO   |     | NULL              |                   |
| stage        | varchar(255)  | YES  |     | NULL              |                   |
| start_date   | date          | NO   |     | NULL              |                   |
| type         | varchar(255)  | YES  |     | NULL              |                   |
+--------------+---------------+------+-----+-------------------+-------------------+
12 rows in set (0.00 sec)

Aggregates Table

+---------------------+---------------+------+-----+-------------------+-------------------+
| Field               | Type          | Null | Key | Default           | Extra             |
+---------------------+---------------+------+-----+-------------------+-------------------+
| row_id              | int           | NO   | PRI | NULL              | auto_increment    |
| version             | int           | NO   |     | NULL              |                   |
| last_updated        | timestamp     | NO   |     | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
| model                 | varchar(255)  | NO   |     | NULL              |                   |
| trim                 | varchar(255)  | NO   | MUL | NULL              |                   |
| total_dollar_amount | decimal(19,2) | NO   |     | NULL              |                   |
| fiscal_period_id   | int           | NO   | MUL | NULL              |                   |
+---------------------+---------------+------+-----+-------------------+-------------------+
7 rows in set (0.00 sec)

I actually make use of a 3rd table to store data about the fiscal period and the fiscal_period_id is a FK to a row in that table. So we see the data model – while not rich enough – is hopefully complex enough for a perf test to be meaningful and mimic real-world scenario.

Deployment and Test Setup

Both the Java and Node.js app were deployed on GCP using Cloud Run. Both apps ran on a VM with 4 CPU cores and 4GB RAM (--cpu=4 and --memory=4Gi). In case of Java app I set the # of app instances to 1 whereas in case of Node.js we use 4 app instances to cover all the 4 CPUs available. Java app would use threads to max out the CPU resources available.

For testing I used JMeter and configured it to mimic 500 concurrent users (500 threads) making requests as fast as they can. In all total 50,000 requests were made (100 requests per user).

And what do we get?

Java Spring WebMVC + JPA (Blocking I/O)

The results were quite impressive. The application was able to handle the load of 500 concurrent threads and not a single request failed. It blew away my expectations. Overall I got:

582 requests per second with 831ms median response time

Node.js with Express, MySQL2 and Sequelize

842 requests per second with 432ms median response time

There are 29 requests that failed out of 50,000 due to request timeout.

What do I make of it?

Node.js performed better but the difference is not what I would call as staggering.

Node.js gives a 40% higher throughput than Java and confirms the rule of thumb – for I/O bound applications Node.js will perform better with its non-locking I/O whereas for CPU bound applications Java will perform better. The result is also inline with another study described here (this post motivated me to do my own test) where it was found that Spring WebFlux + R2DBC gave 30% higher throughput than Spring Web MVC + JDBC at high concurrency. That test was conducted using PostgreSQL database by the way – the database used and its driver can make a huge difference in the results. I feel the 40% increase we observed is significant enough for a mission critical application. It might not be overwhelming enough to warrant re-writing an existing application, but significant enough to tilt the balance in favor of Node.js when developing a greenfield application. Of course, there may be other things you want to consider such as library and data-structures support (Node.js comes with only two data structures – arrays and dictionaries) while performing a full evaluation.

What about non-blocking I/O in Java?

IMO Java is currently trying to add support for asynchronous and non-blocking calls to MySQL. The effort is part of a bigger movement towards reactive programming. From spring.io/reactive:

Reactive systems have certain characteristics that make them ideal for low-latency, high-throughput workloads. Project Reactor and the Spring portfolio work together to enable developers to build enterprise-grade reactive systems that are responsive, resilient, elastic, and message-driven.

There are two libraries in Java for making async calls to MySQL: jasync-r2dbc-mysql and r2dbc-mysql. A third option is vertx-mysql-client but its not based on R2DBC and does not integrate with Spring. I tried using all three but ran into issues. The r2dbc-mysql project is not being actively worked on in fact and hasn’t even matured to a 1.0 release. The other library is also not used extensively in production (at least that’s what it seems like to me) and developed by an independent developer without the backing of a large enterprise company. New code seems to be released without any testing. There were cases where a bug was reported, the developer quickly made a release with the bugfix but introduced another bug in the process. In fact, the r2dbc-spi (the Service Provider Interface that defines the API that drivers have to implement) itself is very recent and only achieved the 1.0.0 milestone on Apr 25 this year (2022). And this is not the end. The Java community is currently working on Project Loom which will bring coroutines (lightweight user-mode threads managed by JVM, not the OS) to Java (analogous to goroutines in Go) and eliminate the distinction between synchronous (JDBC) and asynchronous code (R2DBC). So the R2DBC technology of today will be supplanted by another API eventually and the programming model might change again. R2DBC might not become JDBCs successor after all. With vertx-mysql-client I ran into this issue.

To me, what this means is that for asynchronous and non-blocking I/O we are better off with a technology like Node.js.

For further reading, refer this related article which has a very similar setup except that it uses non-blocking I/O on Java side and further uses MongoDB. It reports identical performance between Java and Node.js when both are using non-blocking I/O with MongoDB.

Also see This website for spring-jpa vs. node.js:

Their database is PostgreSQL in case of spring-jpa and MySQL in case of Node.js. It contains many more benchmarks.

Posted in Computers, programming, Software | Tagged , , , , | Leave a comment