Our Technologies
In 2012, when few people had an idea of applying GPUs to database processing and nobody knew how to design the software, we released the prototype of PG-Strom that is the basis of the current version. Since then, we have gone through a lot of trial and error. Some of then didn't work well after all, and were shelved.
This page introduces our elemental technologies that have become the core of the current PG-Stom in our 10-year history of applying GPUs to database processing.
GPU-Direct SQL
Using P2P-RDMA technology, it directly transfers data blocks from high-performance storage such as NVME-SSD to GPU devices, and eliminates junk records by GPU's parallel execution of SQL workloads prior to the data arrival at the CPU / RAM. It allows extremely reduce the number of records to be processed by the host system (PostgreSQL), in the results, query response time shall become faster.
As literal, P2P-DMA (Peer-to-Peer Direct Memory Access) technology allows peer data transfer between PCI-E devices like NVME-SSD and GPU devices, bypassing the host-system's RAM. In addition, data may be transferred over high-speed network such as Infiniband; using "SSD -> NIC -> (network) -> NIC -> GPU" chain, which is called P2P-RDMA (Remote DMA). Even in that case, data can be transferred bypassing the host system buffer, and its transfer throughput close to the hardware catalog specifications can be expected.
GPU-Direct SQL technology, released in 2017, uses HeteroDB's proprietary Linux kernel driver to intermediate P2P-DMA between NVME-SSD and GPU, so that data can be delivered at a throughput equal to the hardware limit of NVME-SSD. It has achieved the data processing throughput 10GB/s per GPU, which is the fastest class in the PCI-E 3.0 generation.
The figures below plots the processing performance (number of rows processed per second and storage read throughput) in the NVME-SSD x4 and GPUx1 configuration, by the Star Schema Benchmark. Even when comparing row data (PostgreSQL Heap), GPU-Direct SQL shows 3 times faster throughput than the usual I/O via filesystem, due to the small overhead of data transfer. In case of the efficient column data (Apache Arrow), you can see that it literally demonstrates "order of magniture" processing performance.
After that, in 2021, the equivalent function as the above Linux kernel driver was integrated into CUDA Toolkit 11.4; called GPUDirect Storage from NVIDIA. PG-Strom also enhanced to adopt the GPUDirect Storage for Red Hat Enterprise Linux 8.x and Ubuntu Linux.
In the result, the GPU-Direct SQL mechanism can use not only local NVME-SSD, but also P2P-RDMA from remotely scalable storage such as NVME-oF, NFSoRDMA and SDS (Software Defined Storage). It also means we can use the extreme performance by GPU-Direct SQL and capability of storage management required for business systems together.
GPU-Direct SQL with NFS-over-RDMA
As an application of GPU-Direct SQL, it is possible to read data directly from the shared file system (NFS) to the GPU via a high-speed RDMA network such as 100Gb-Ethernet.
Unlink local filesystems such as Ext4 and Xfs, NFS is not a good place to store transactional data (eg PostgreSQL tables) that involves frequent updates. However, for the workloads of insert-only and reading a large amount of data, such as the Apache Arrow format below, we can use GPU-Direct SQL and NFS-over-RDMA together. Data can be supplied to the GPU at the data transfer rate close to the network bandwidth.
Since NFS is a shared file system, for example, a server that collects IoT / M2M log data can write them out as Apache Arrow files onto the NFS server, on the other hand, a database server equipped with a GPU may be able to read at a speed of the upper limit of the network bandwidth during search and aggregation processing.
Arrow_Fdw
Apache Arrow is one of the columnar and structured data format, and it is also a common format for data exchange among various data analysis applications. Therefore, many programming languages such as C / C ++ and Python have libraries for handling the Arrow format.
PG-Strom uses PostgreSQL's Foreign Table feature as a basis of Arrow_Fdw; that allows to read Apache Arrow format files as if they were PostgreSQL tables. It also supports direct reading powered by GPU-Direct SQL. If it is a bulk insert, it can be appended to end of the Apache Arrow file.
Why data importing is not necessary?
When we import external data to database system for analytics, or summarizing, it usually takes a considerable amount of time for the data importing process by INSERT or COPY FROM. However, Arrow_Fdw of PG-Strom only maps the Apache Arrow files created by external software as a foreign table. It does not involve fetching and rewriting a large number of records, so it is almost a simple file copy on the operating system.
Why I/O is efficient?
It is rare to reference all the columns on the table for summarizing / analytic workloads. Usually, it is enough to refer to only a part of columns, in other words, the data other than the referenced columns is "garbage data", which unnecessary squeezes the I/O bandwidth.
It is difficult to extract only the referenced column from the storage for the row data such as PostgreSQL, however, in the case of column data, we can read the referenced columns only because the data is arranged adjacent to each column.
min/max statistics and range-index
The Apache Arrow format is designed to have multiple data chunks called Record Batch inside it. RecordBatch is a collection of equal length values-array for each column. For example, an Arrow file with 1 billion rows in total may contains 1,000 RecordBatches that have 1 million rows for each.
The min / max statistics are a list of min and max values at the target column for each RecordBatch, embedded in the custom-metadata field where application can use arbitrarily. The pg2arrow and pcap2arrow commands support this feature.
For example, if a foreign-table is scanned with condition: "ymd BETWEEN '2020-01-01' AND '2020-12-31'", we can reasonably expect no records shall be fetched from the record-batches with max_value of ymd column is less than '2020-01-01', or min_value of ymd column is larger than '2020-12-31'.
When we read Apache Arrow files via Arrow_Fdw, it checks min/max statistics to skip the record-batches that should not contain any valid records towards the given condition. Originally, the Arrow format has columnar data structure to skip unreferenced columns for efficient I/O, but in addition, we are further improving the I/O efficiency by skipping unnecessary record-batches (a.k.a bunch of rows).
Pg2Arrow / Mysql2Arrow
Although many data analysis applications support Apache Arrow, it is a relatively new data format, so the existing system may be built with RDBMS, or it takes time for engineers to become familiar with it.
The tools pg2arrow and mysql2arrow, bundled with PG-Strom, throw an arbitrary query to PostgreSQL / MySQL, then they save the result as an Apache Arrow file.
Not only simple table dumps, but it also allows to write out the result of JOIN with another table, filtered by specifying conditions, or sorted by a particular key.
These commands also enable to generate Arrow format files from the existing database, not only to read Arrow format files to database via Arrow_Fdw.
Packet capture using Pcap2Arrow
As networks become faster, the size of packet captures increases, and it becomes more difficult to investigate the capture files at the event of security incidents.
The tool Pcap2arrow, bundled with PG-Strom, allows to capture packets from the network interface card directly, or to convert PCAP files into Apache Arrow format; that is possible to investigate the packet logs in standard SQL.
To minimize packet loss when pcap2arrow captures the packet directly from the NIC devices faster than 10Gb, it internally uses PF_RING driver that buffers the captured packets inside of the Linux kernel, and captures / writes the packets in multi-threads to support fast network capturing.
Once it is converted to Apache Arrow format, PG-Strom allows to search and investigate the packet log using standard SQL; including various flexible conditions. Of course, it also supports the high-speed data loading using GPU-Direct SQL, and refining the scan with min / max statistics.
GPU-revision PostGIS
When we try to extract mobile-phone devices within a particular area based on their GPU information, for example, we need to check whether the current position indicated by latitude and longitude is contained in the complex shaped area.
PostGIS defines various SQL functions and operators that can be used for these purpose, and widely accepted for the database searching using gepmetric elements like points, lines and polygons.
PG-Strom allows to run a part of PostGIS functions / operators on GPU devices. For example, the st_contains(), which determines whether a certain coordinate is contained in a specific area, also supports polygons with complicated shapes, so it is relatively computing-intensive workload if this is processed in a straightforward manner on CPU. However, response speed can be improved by executing GPU-revision of st_contains() in parallel on the GPU.
GiST-Index (R-tree) support on GpuJoin
Although there are several GPU database systems that supports GIS functions, but only PG-Strom allows to refine the search using index on the GPU.
PostGIS originaly used R-tree data structure to index geometric items such as points or polygons, and refining the search criteria and response time.
If the optimizer determines that R-tree index can be used for GpuJoin's table join conditions, PG-Strom loads the R-tree index onto the GPU device memory, and references it for more efficient search.
In principle, index search can be executed in parallel without considering conflicts with other threads, it is an ideal workloads for GPUs with large number of processor cores. In some cases, it has achieved 200 times faster response time than CPU parallel execution. This feature has a potential to cause qualitative changes, such as changeover of batch processes into real-time operations.