Customer Use Case: Near real-time processing
Nowadays, more and more companies require real-time data processing. There are many software vendors on the market offering real-time platforms and streaming solutions. But what can be done when classic OLAP (Online Analytical Processing) need to be processed in real time?
For example, consider the following SQL query:
select dept.name as dept_name, emp.name as emp_name
from emp
inner join dept
on emp.dept_id = dept.dept_id;
The tables contain millions of rows, are approx. 5 GB in size, and receive hundreds of change events like updates/inserts/deletes per second. Since processes cannot run continuously for an indefinite time (e.g., due to software updates or errors), we occasionally need to restart them, and each time we do, we must read all data completely to apply the incoming changes. Additionally, these changes should be reflected as soon as possible — the expected latency is under 10 seconds. 
What technical solutions and tools can be used to meet such requirements? What are the advantages and disadvantages of the different tools/platforms? One of our customers faced exactly these challenges. Let’s look at what can be done and how a possible architecture might look.
Because the customer uses AWS, we limit ourselves to solutions that are either AWS-native or can be operated on AWS.
1. Source system description
- Based on Oracle 19.
- Typical OLTP system: highly normalized, constantly receiving changes (inserts, updates, deletes).
- Schema changes occur about four times a year.
- Changes can happen in the following ways:
- Manually via GUI
- By data loads from external systems
- By internal batch processes
- Size: approx. 20 GB
Example table sizes that provide data for real-time processing:
- RELATION – 19.6M rows, ~2 GB
- OBJECT – 19.7M rows, ~7 GB
- OBJECT_DESCRIPTION – 18.7M rows, ~2 GB
RELATION_CREATOR – 1M rows, ~1 GB

2. Requirements
- No further development in the source system — it only serves existing processes, which may be extended/optimized. All new products must be implemented on other platforms/databases.
- At least three data layers are expected:
-
- An OLTP source system. The source system provides data to the other layers.
- Business Object Layer: this layer contains about nine business objects. Each object is a denormalization of the source tables. It also consumes data from other systems to join with the source-system data, which are provided either as
- files (batch mode), or
- on-demand queries from other databases via JDBC/ODBC.
Data in this layer must be updated in near-real-time — i.e., as soon as changes occur in the source system they must be reflected in this layer as quickly as possible. It must be possible to read data from this layer in both FULL and DELTA modes (only changes since the last fetch)select r.relation_id,
r.object1_fk, o1.name as object1_name, o1d.desc as object1_description,
r.object2_fk, o2.name as object2_name, o2d.desc as object2_description,
r.creator_id, mp.name as exchange_name
from relation r
left outer join object o1 on r.object1_fk = o1.object_id
left outer join object_description o1d on o1.object_id = o1d.object_id
left outer join object o2 on r.object2_fk = o2.object_id
left outer join object_description o2d on o2.object_id = o2d.object_id
left outer join relation_creator rc on r.creator_fk = rc.creator_id;
- Master Data Layer: This layer uses data from the Business Object Layer as well as from other systems and joins them with the BO-layer data. Here too, data must be updated in near-real-time — i.e., when changes occur in the Business Object Layer, they must be reflected in the Master Data Layer as quickly as possible. Typical transformations: filters, joins, aggregations.
-
- In addition, all changes to the business objects (and to the underlying source tables) must be saved and quarriable via SQL.
- Latency must be under 1 minute (from the time of the change in the source system until it is reflected in the Master Data Layer). Preferably, latency should be under 30 seconds.
- Target platforms and tools are not strictly defined.
- Oracle is not intended as a target platform (only the source system remains on Oracle). In the future it will be migrated to Postgres, so the solution must be designed to work with other RDBMS (e.g., Postgres instead of Oracle).
- The selected tools/platforms must be managed or serverless, because the customer does not want to install and operate tools/platforms themselves.
- The solution should be implemented in SQL. If that is not possible, Java may also be used.
On a high level, the following is expected:

3. Possible tools and approaches
- CDC/CDF (Change Data Capture / Change Data Feed): CDC events can be extracted from Oracle and processed in downstream systems, layers, or tools. This way you also capture changes that can be analyzed.
Possible tools: AWS DMS, Debezium (as a standalone server or as a plugin for other tools such as Kafka).
Advantages of CDC: low load on the source system.
Disadvantages: CDC events are not easy to correlate/join with each other (join complexity). - Kafka and AWS Kinesis: These platforms can be used to distribute data (e.g., CDC events) in a publish–subscribe mode. Data can be consumed by other systems and additionally stored on S3. Although both platforms can be used as durable storage (Kinesis up to 365 days; Kafka unlimited), this is not always efficient. Usually, events are retained only for a limited period (e.g., one week). Kinesis supports at-least-once delivery. Kafka supports at-most-once, at-least-once and exactly-once. Kafka integrates well with Debezium and AWS DMS. Kinesis is primarily supported via AWS DMS (although other integrations are possible with additional development effort).
- Spark: Spark is a framework that enables efficient data processing via the DataFrame API (Java, Scala, Python) and via SQL. It supports both batch and streaming modes and can read data from Kafka and Kinesis. On AWS, Spark can be used in EMR or Glue; alternatively Databricks may be used.
- Flink: Flink is a framework for efficient stream processing, but it also supports batch processing. It can read data from Kafka and is available in AWS as part of EMR or as a managed Apache Flink service. Ververica is an alternative distribution.
- StarRocks: An in-memory OLAP platform that supports asynchronous materialized views (i.e., when source tables change the views are updated automatically). However, the platform does not support CDC/CDF.
- Materialize and RisingWave: Both platforms are so-called streaming OLAP databases. They can read data from various streaming systems (and can themselves be fed by CDC processes running on different RDBMS), compare changes to the existing state on the fly and apply updates, and persist those data adjustments as CDF into other systems. Roughly speaking, they work like Oracle materialized views in incremental mode, but much faster.
4. Analysis of challenges
- Join strategy: Full data sets must always be joined because changes are not limited by time or group; potentially any row can be modified.
- Batch mode: Batch processing does not allow true near-real-time. However, with powerful instances and platforms such as Postgres, Redshift, Databricks, or StarRocks, fast materializations in loops are possible.
- Oracle option: In Oracle, so-called materialized views in an “incrementally refreshing” mode could be used to meet the requirements. With Oracle’s In-Memory option, such views can also be made much faster. Since Oracle is not intended as the target platform, this solution is unsuitable.
- Alternative to Oracle: Another platform could be used that offers functionality similar to Oracle’s materialized views and can keep data in memory. StarRocks would be such an alternative. However, the platform does not support CDC/CDF, which means changes cannot be logged.
- Spark — pros and cons in streaming mode:
1. Spark supports streaming but cannot read CDC events directly from source systems. Therefore, an additional platform like Kafka is required. Since Kafka is planned anyway, this is not a problem.
2. Spark does not automatically recognize CDC events. Before a join in streaming mode, you therefore always have to compute the latest version per key:
select pk_id, attr1, attr2, …
from tab
where (pk_id, cdc_timestamp) in (
select pk_id, max(cdc_timestamp)
from tab
group by pk_id
);
This impairs performance and causes high resource consumption because the records must be grouped before the join.
3. In AWS Glue it is not possible to start two Spark streaming processes within a single job. Possible solutions:
- Use vanilla Spark: From Glue version 5 this becomes complex due to library dependencies. You can use Glue v4, but then you do not have a current Spark version.
- Use AWS EMR instead of Glue. However, this may incur higher costs.
4. The data objects must be materialized in a way that makes them usable in downstream layers both in near-real-time and in batch mode. Possible approaches:
- Kafka topic: fast storage but append-only. For FULL extracts or joins in Spark you must recompute the latest version per key.
- External database (e.g., Postgres) in append-only mode: slower than Kafka but with the same disadvantages.
- External database in CDC-apply mode: in Spark, materializations are not idempotent, so you have to implement CDC-apply logic yourself using forEachBatch or forEachRow.
- S3 with lakehouse formats such as Iceberg, Delta, or Hudi.
5. If multiple streaming processes (e.g., for RELATION and OBJECT tables) start asynchronously, it may happen that at the time of a join the required rows in one table (e.g., OBJECT) are not yet loaded. The result is NULL values in the LEFT OUTER JOIN. For a short time, an inconsistent state occurs, which complicates historical analyses and audits. Spark solution: a targeted startup delay between processes.
6. Flink — advantages over Spark:
a) Flink can read CDC events directly from the source (e.g., via Debezium), so Kafka is optional.
b) Flink automatically recognizes CDC events. That means it is not necessary to compute the latest version per key before joins → better performance.
c) In Flink, too, data objects must be materialized. The options are identical to Spark:
- Kafka topic: append-only, but unlike Spark, Flink does not require grouping to obtain the latest versions.
- External database (append-only): slower, same disadvantages as with Spark.
- External database (CDC-apply): Flink supports idempotent materialization, so no manual CDC implementation is required.
- S3 with lakehouse formats (Iceberg, Delta, Hudi).
d) Asynchronous start of multiple streams:
As with Spark, this can lead to inconsistencies. A solution with startup delays is theoretically possible, but in Flink it is more complex due to the continuous streaming nature. An alternative solution: Flink supports a two-stage startup:
- Snapshot phase: Flink reads the current state of the data.
- CDC phase: Flink processes all changes since the snapshot. This allows starting streams sequentially (e.g., start RELATION only after the OBJECT snapshot has been taken).
7. Kafka — additional challenges:
- Kafka itself has no processing logic, so additional engines such as Spark, Flink, or KSQL are required.
- If all records are required starting from the oldest offset:
- Either: store all events permanently, which requires large Kafka clusters and high costs. As the number of events grows, read performance decreases.
- Or: use topic compaction, where only the latest event per key is kept. However, compaction does not happen instantly. With many updates it would need to run frequently, which contradicts Kafka’s design (tools that read from Kafka will likely need a topic retention minimum of one week).
- Additionally, all data should be stored on S3, regardless of whether Kafka is used fully, or topic compaction is applied — because manual analysis of historical events directly in Kafka is complex and inflexible.
- Serialization/deserialization (particularly JSON) can be slow. AVRO is recommended, but the AVRO schema registry must be supported by all consuming tools. Example: Flink currently supports only the Confluent Schema Registry.
- To join tables correctly, full records are required — the latest state is not sufficient. Example: if there is a change in the OBJECT table but none in RELATION, there will be no join match. Therefore, it is necessary to read from the oldest offset, which leads to growing data volumes (compaction is a possible but not ideal solution).
8. Materialize and RisingWave
- Although both platforms can read data from Kafka, they do not support direct CDC from Oracle (as of Summer 2025). This means that to consume CDF from Oracle you must use tools such as Debezium or AWS DMS together with Kafka to prepare the data for Materialize and RisingWave.
- Both support Confluent Schema Registry, but support for other registries (such as AWS Glue Schema Registry) is limited (as of Summer 2025).
5. Decision and coarse architecture
After analyzing pros and cons, the decision was to use a CDF/CDC-based approach: extract all changes via CDC from the source system and forward them to downstream platforms/tools.
Selected components:
- Kafka (AWS MSK) as distribution platform for CDC/CDF
- Debezium with MSK Connect as CDC extraction tool
This combination is viewed as the most flexible solution.
Subscribe to the Apps Associates blog for more in-depth technical architecture discussions. Connect with us on LinkedIn to ask questions and chat with our experts. For more information on Apps Associates, contact us today.
