IAS Technology Insider is dedicated to discussing computer science, software engineering, and emerging technologies. From deep dives into machine learning algorithms and cloud computing architectures to discussions on cybersecurity trends and data analytics methodologies, our tech experts offer insights and analyses that resonate with enthusiasts and professionals alike.
By Yuvaraj Mahendran, Principal Engineer, IAS
At Integral Ad Science we constantly experiment with technologies to process massive datasets and get insightful performance details for customers. One of our major initiatives over the upcoming quarters is to introduce streaming in our multi-billion-events-per-hour data ingestion layer and provide real-time metrics for our customers. Introducing streaming into this massive pipeline could easily span multiple quarters before reaping any benefit, if not properly planned. This blog covers our phased plan to introduce streaming in our system and highlights tracers we added to automatically test data consistency in the streaming pipeline.
Batch processing pipeline
The current log processing pipeline is a batch oriented process that relies heavily on event logs generated by hundreds of edge applications deployed across the globe. We have a horizontally scalable log transfer process that ensures that logs from all edge applications make it to the central datastore (S3 at this point). The logs are transferred into buckets corresponding to ten or twenty minute time chunks and persisted for downstream applications to process. The processors for the appropriate time chunks are triggered once our home-grown application, Gatekeeper, can assert that logs from the appropriate edge applications have made it to the central S3.

Advantages of batch processing pipeline
- Clear time boundaries: When log files are corrupt, the recovery process is easy and straightforward. Re-fetch / regenerate data from edge applications for processing specific time chunks where data is corrupt.
- Reprocessing logs for specific time chunks are relatively easy: If and when there are issues with the output for a certain time period, it is easy to reprocess just those logs and regenerate the output for the time period in question.
- More resilient to data delays and problems with edge applications: Batch oriented systems are geared to only start processing after all the required logs make it to the central datastore.
- Relatively much easier to validate data completeness: Given the entire processing is log based, it suffices to ensure log files in their entirety are available.
Disadvantages
- Processing for a given time chunk only begins when all the logs for a given time chunk is available.
- Process delays: Given the wait times introduced between processes in the batch-based system, the overall time to deliver our output and customer-facing reports gets pushed at least by a few hours.
Stream Processing
While converting our entire ETL layer to be 100% stream-based can easily be a multi-quarter undertaking, our goal is to introduce streaming in parts of the system, allowing us to evaluate different technologies, handle shortcomings, and proceed or pivot appropriately. Hence as a first step our plan is to convert the data ingestion pipeline to streaming.
Each edge server currently has two main applications running:
- Client-facing Application: receiving billions of requests hourly and generates log files local to servers themselves.
- Stellar — a separate companion backend application: takes care of ingesting logs into S3.
We are enhancing Stellar to stream log records into Kafka. Recently we did a case study to compare and contrast Kafka and Pulsar and decided to stick with Kafka.
Ensuring data consistency
In addition to enabling exactly-once in producing messages to Kafka, we added two mechanisms to monitor data consistency:
- Health pings: Stellar instances, in addition to streaming log records, will produce health ping messages every 5ms (configurable). A separate application (Gatekeeper) is responsible for consuming these messages to ensure Kafka connectivity and infrastructure stability by ensuring the right number of health-pings are present.
- High-level counts: Stellar, as it streams data into Kafka, will also be gathering high-level metrics on some key entities and aggregate them to minutely granularity and persist these stats in Cassandra (stay tuned for a blog entry about performance analysis between Cassandra vs Aerospike vs DynamoDB for this use case). The data consumer applications can then verify data at appropriate time-chunks to ensure 100% of data has been consumed.

It’s that simple, right?
Not quite. While stream processing can improve the performance of our data processing, it does not come for free. Here are some of the caveats:
- Increase in complexity of System: Streaming introduced multiple new components, including infrastructure into our ecosystem, all of which have to be monitored, maintained, and enhanced. In our case we wanted to ensure the benefits outweigh the complexity introduced, before we convert all the etl applications to streaming.
- Handling data delays: In batch processing while data delays could potentially have a domino effect on the aggregation pipeline, this could be offset with parallel processing of input data. In the streaming world, this is not an easy task. Data processing frameworks like Flink do offer options to handle late-arriving data, but those require more memory while processing (memory, instead of disk-space to contain the SLA). It is still a work in progress in this area for us to explore more options.
- Cost for reprocessing: It is common to have a persistent cluster to process streaming data. Hence when the need arises for reprocessing old data, we invariably have to launch a new cluster to process or have to create a way to pump old data into live streaming pipes introducing more complexity into the codebase and / or infrastructure.
Conclusion
At Integral Ad Science, as we are continuing to ingest logs via S3, we are also introducing a Kafka-based streaming pipeline to have some of the non-revenue sensitive applications consume from this pipeline. After a few sprints of exposing ourselves to the positives and hurdles in this approach, we might appropriately convert other parts of our system as well.
Article originally published on January 6, 2021.