PG-Strom

An open source extension that pull out the full capabilicy of GPU and NVME, for extreme acceleration of PostgreSQL by the power of thousands processor cores, and tackles large scale data processing more than terabytes scale.

What is PG-Strom?

PG-Strom Overview

PG-Strom is an extension of PostgreSQL, one of the most widely adopted OSS-DB system, designed for searching, summarizing and processing of large scale data-set more than dozens of terabytes and billion rows, with utilization of the latest hardware like GPU, NVME-SSD and so on. It enables to perform this kind of heavy workloads on a simple singe-node configuration, but surprisingly fast.

GPU Characteristics

GPUs have thousands of processor cores, and are designed to perform parallel computing workloads extremely efficiently. One typical workload is matrix-operations, that perform uniformed operations on a large amount of homegeneous data-set concurrently. There are some similar workloads in SQL, for example, evaluating WHERE clause for each record in full table scans.

PG-Strom Design

Architecture to maximizes the performance of high-speed strage

Its architecture intends to maximize the storage read performance which is indispensable on large scale data-processing, even though it is rare for the database system that supports GPU utilization. Its GPU-Direct SQL mechanism, the core feature of PG-Strom, directly connects NVME-SSD and GPU to process SQL at a speed close to the limit of hardware by skipping all the redundant steps.

Prior to searching / summarizing... Elimination of data importing

In addition to PostgreSQL tables, PG-Strom supports the Apache Arrow format. This is a data format often used in other IoT / M2M applications and Python that is a de-facto in the field of machine learning. , and by supporting direct reading of external standard data formats, aggregation and search processing can be performed. By supporting direct read of external standard data formats, it eliminates the the "unnecessary" data importing task task itself before aggregation, search or processing.

Engineers skill and experiences in PostgreSQL are still valuable

PG-Strom is an extension of PostgreSQL and runs internally. It transparently replaces portions of query execution plan to handle a part of SQL workloads to execute on the GPU devices. In principle, there are no changes to the SQL syntax and drivers / libraries, and we can use the methods familiar with PostgreSQL for fundamental DB functions such as heartbeating or backup. So, database management and development skills that engineers have experienced so fat are not wasted.

Extreme Performance on IoT/M2M log-data daily growing for searching / summarizing

In the IoT / M2M area, we collect log data daily generated by various devices and sensors, and search and analyze it from various angles.

Time-series data tends to be huge, and we have to pay attention for the time required to import these data into database systems also, not only the response speed of search / aggregation queries to the log data.

PG-Strom supports direct reading of Apache Arrow format files that can be output by log collection services such as Fluentd, in addition to PostgreSQL tables. Therefore, the search / aggregation process can be executed immediately without importing the collected data into the database system again.

On execution of search queries towards the log data collected in this way, the data is directly loaded from the local NVME-SSD or the storage server connected via the high-speed network to the GPU using P2P RDMA technology. This allows to pull out the full capability of the hardware; we had measured billion rows per second (40GB/s) data processing performance on a single standalone server.

Simplified system configuration can reduce man-hours and troubles of operations. It is originated from PostgreSQL in which many engineers are familiar with, so they can utilize the skills and experiences, and improve the quality of business applications.

Geolocational data analytics powered by GPU

The number of devices that generate location information linked to GPU, such as mobile phones and autovehicles, are continuously increasing.

Along with this, demands like "advertisements delivery to a particular area", "accident / congestion information delivery to nearby cars" are also increasing. On the other hand, area information is usually defined as a complex polygon, and it is not a lightweight operations to match with position information expressed in latitude and longitude.

GPU-revision of PostGIS can efficiently run these workloads on the GPU's thousands of cores, as well as PG-Strom is the only GPU database that can refine your search using geolocational index; GiST (R-tree) index built on PostgreSQL.

These functions allows to perform analysis and search of geolocational information at high speed with a simple system configuration based on real-time location data gathered from mobile phones or autovehicles.

Packet capture and search at the all-in-one system

As network traffic increases, it becomes more difficult to investigate when a security incident occurs and to maintain and search audit trails.

Pcap2arrow (bundled with PG-Strom) allows to capture packets from the network interface card directly, or to convert PCAP files into Apache Arrow format.

Once it is transformed to Apache Arrow format, we can save the data for a long term as an archive, and we can map them to PG-Strom without importing data. It allows us to search and investigate the packet log using SQL with flexible conditions based on the attribute of packets.

Business Intelligence & Reporting

Typical summarizing SQL workloads, often used for business intelligence (BI) or reporting, are suitable for parallel execution by massive CPU/GPU cores and require storage system high I/O throughput. PG-Strom is optimized to this kind of workloads, thus enables rapid summarizing with all the hardware resources like CPU, GPU and SSD.

Therefore, it allows to replace legacy systems, expensive DWH appliance or cluster based systems people often adopted in the past, by simple PostgreSQL-based solution.

In addition, since the interface with BI tools is PostgreSQL itself; many engineers has been familiar with, they can leverage their skills and expericences to operate the database system with various comprehensive drivers and applications.

Machine-Learning / Anomaly Detection

We need to find out “anomaly” as frequent as possible, to detect criminal transaction, like credit-card skimming or bank transfer scam, from the daily transactional records.

PG-Strom supports to run anomaly detection logic based on statistical analysis algorithms directly on the transactional records stores in the database. GPU can process this logic very fast, and no need to export the database for checking because of in-database processing.

For more advanced machine-learning applications, pl/Python of PostgreSQL allows to link machine-learning libraries implemented with Python, at the in-database operations.

Product Configuration / Specifications

Hardware configuration example for all-in-one entry model

The four internal NVME-SSDs supports data volumns from 10TB to 20TB, and it makes large amount of data processing possible by the extremet performance by NVIDIA A100 and ultra wide throughput by PCIE 4.0. It is an all-in-one entry model with all the necessary elements packed in a 2U rack server.

model: Supermicro AS-2014CR-TR
CPU: AMD EPYC 7443P (24C; 2.85GHz) x1
RAM: 128GB (16GB DDR4-3200; ECC) x8
GPU: NVIDIA A100 (PCI-E; 40GB) x1
SSD: Intel D7-5510 (U.2; 3.84TB or 7.68TB) x4
OS: Red Hat Enterprise Linux 8.3, or later
Ubuntu Linux 20.04, or later

Hardware configuration example for scalable enterprise model

High-end configuration that supports scalable NVME-oF (NVME over Fabric) or SDS (Software Defined Storage) storage and multiple GPUs directly connected via a 100Gb high-speed network to handle IoT/M2M log data that is ever incresing.

(*) A separate storage server is required.

model: Supermicro AS -4124GS-TNR
CPU: AMD EPYC 7443 (24C; 2.85GHz) x2
RAM: 256GB (16GB DDR4-3200; ECC) x16
GPU: NVIDIA A100 (PCI-E 4.0; 40GB) x4
NIC: Mellanox ConnectX-5 (100GbE; dual; PCIE 4.0) x4
OS: Red Hat Enterprise Linux 8.3, or later
Ubuntu Linux 20.04, or later

Software Subscription

PG-Strom Enterprise Subscription (1GPU, 1year)

  • This subscription is required for each GPU device used by PG-Strom

  • Note that it is not number of GPU devices installed on the target system.

  • This subscription includes below:

    • License key for the extra modules

    • Technical support of the target system

    • Online updates of the software

  • The extra module additionally supports the features below

    • Use of multiple GPU devices

    • Use of md-raid0 (striping) at GPU-Direct SQL

    • GpuJoin with GiST-Index at GPU-rev PostGIS

    • Cardinarity estimation using HyperLogLog

  • This subscription has an open price. Please contact us from the contact form at bottom of the page.