Not Your Dad’s Database » Linux Magazine

0
7


PostgreSQL, an open source object-relational database management system known for its reliability and extensibility, offers a robust feature set. If you are new to PostgreSQL, we help you get started with some of its most useful features.

If you tracked the development of PostgreSQL in the ’90s, you may remember it as a slow, resource hungry open source database later surpassed in popularity by the faster and easier to use MySQL. If that’s the case, you might be surprised by PostgreSQL 16. While the name is the same, along with the similar syntax of familiar tools such as psql or pg_dump, the internal architecture has been vastly reworked. At the end of the day, it is basically a completely different database.

PostgreSQL is a mature, polished product that has been in development for almost 40 years. At the same time, it has developed at a rapid pace and offers many new, interesting features. In fact, it is practically impossible to describe all of these features in a single article, or even cover the most interesting features to the extent they deserve. In this article, I cover the most useful features for readers who are not currently using PostgreSQL.

The Elephant in the Room

In the past, an enterprise database management system (DBMS) was almost synonymous with Oracle, with IBM DB2 and Microsoft SQL Server (and at some point Sybase ASE and others) as competitors. While these DBMSs are still in use (mainly by large organizations who are by nature inert), there has been a move to migrate from proprietary to open source technologies. This emerging trend led Oracle to acquire MySQL from Suns Microsystems in 2010 (who acquired it from MySQL AB in 2008), which resulted in the MariaDB fork. This trend continues today, but this time PostgreSQL is the clear leader. PostgreSQL’s stability, modern features, flexibility, and extensibility make it a universal DBMS that not only fits various use cases, but excels at them.

The MongoDB story illustrates PostgreSQL’s performance capabilities. When the NoSQL movement gained in popularity, the new solutions like MongoDB were touted as superior to traditional SQL databases on many aspects, especially performance. In the end, PostgreSQL beat MongoDB with MongoDB’s own benchmarks. While MongoDB has improved and fixed its many past flaws, more recent benchmark results still show PostgreSQL as a viable alternative to MongoDB in certain scenarios [1].

Legacy databases are unattractive to modern developers for many reasons. An open source database lets you immediately spin up a disposable instance and basically develop for it in no time. With closed solutions, even if you get a limited version such as Oracle XE, it’s still quite complex compared to open source solutions (for an account of what Oracle calls “as easy as ‘ABC’,” see their blog [2] on installing Oracle XE in a Docker container to judge for yourself).

With PostgreSQL, on the other hand, you can put your relational DBMS (RDBMS) in a container and interact with it in just three steps. First, you need to get the most recent Docker image:

docker pull postgres

Then, run the server in the background:

docker run --name mypostgres -e POSTGRES_PASSWORD=mypassword-d postgres

Finally, connect to it with:

docker exec -it mypostgres psql -U postgres

That’s it! Now you have a fully functional PostgreSQL client and server setup that you can use with some of the examples later in this article.

Eating the Database World

PostgreSQL fills a range of needs. In a blog post on Medium [3], the creator of Pigsty (an open source RDBMS alternative) [4] argues that PostgreSQL-based solutions are valid not just for online transaction processing (OLTP) but also for online analytical processing (OLAP) workloads thanks to ParadeDB [5] and for a SQLite-based option, see DuckDB [6]. PostgreSQL can do even more. If you need a graph database, you can use the Apache AGE PostgreSQL extension [7]. PostgreSQL allows you to connect to basically every modern database with Foreign Data Wrappers (FDWs) [8]. If you need a vector database, the pgvector extension works out of the box and has bindings for all popular languages. PostgreSQL offers a PostGIS extension if you need a geospatial database. The TimescaleDB extension [9] has you covered for time series databases. The list goes on.

You may be wondering how PostgreSQL performs and how it compares to other databases, as well as whether it will fit your use case. While you will need to figure out some of these answers for yourself, you can use the results of the ClickBench benchmark [10] as a guide. When you dig around, you will discover that the final result largely depends on whether PostgreSQL is fine-tuned for a particular use case or not. PostgreSQL is a universal and extremely flexible database, but you need to tune it for a particular load if you need to squeeze maximum performance out of it.

Another Brick in the WAL

Before turning to examples, it’s important to understand a few key concepts related to how PostgreSQL works. Multiversion concurrency control (MVCC) is an important mechanism that provides concurrent access to the database by multiple clients without locking and without making it look inconsistent; MVCC is achieved by making multiple copies of the data [11]. Basically, MVCC enhances read performance by allowing transactions to view the database in a consistent state without locking data. This system has been refined over the years to reduce overhead and increase transaction throughput.

PostgreSQL’s Write-Ahead Log (WAL) contains the records of changes. Because the changes are first written to WAL before being committed to disk, the log ensures data integrity. Think of it as an equivalent of the journal in the filesystem world (which is why, strictly speaking, PostgreSQL doesn’t require a journaling filesystem). Because WAL contains the record of all changes in the database, it can be used to “replay” them (i.e., reconstruct the database), which can be used for backups and point-in-time recovery (PITR) where you can restore the database to any chosen point in time.

Shared memory contains the database and log cache, including the shared buffer whose main task is to minimize disk I/O operations, and the WAL buffer, which contains the content that has not yet been written to WAL. A checkpoint is a point in WAL where the data perfectly reflects the changes in WAL (think of the sync command for filesystems). You can enforce a checkpoint using the CHECKPOINT command. By default, a checkpoint happens every five minutes, but the setting can be controlled by the checkpoint_timeout parameter.

As for replication, physical replication operates at block device levels and replicates the physical data blocks. Logical replication replays the transaction in WAL on replicas.

You also need to understand the difference between database partitioning and sharding. Partitioning splits a large table into smaller, more manageable pieces, while still treating them as a single table. Sharding, on the other hand, distributes data across multiple database instances or servers and potentially across different physical locations. Sharding is suitable for very large databases where partitioning is not enough to achieve desired performance and scalability, but it comes with its own set of problems. Imagine you have a huge table containing the data of all citizens in a given country, ordered by birth date. With partitioning, you can split this huge table into many smaller ones by birth year. With sharding, you might put all citizens whose names start with an “A” in one database instance, those with “B” in another, and so on. Apart from the complexity in managing such a set up, an inherent problem is that, with time, sharded databases become unbalanced since you can’t precisely predict how each instance will grow (e.g., instance “A” might grow more rapidly than instance “B”).

PostgreSQL also supports parallel query processing (executing relevant queries on multiple cores/processors for enhanced performance), various types of indexing, foreign data wrappers, and many other useful features. I will describe some of these features in the examples that follow.



Source link