Optimizing Log Ingestion for SIEM: Reducing Redundancy and Leveraging a Data Lake

In a production network, multiple security and networking appliances generate logs, including intrusion prevention/detection systems (IPS/IDS), firewalls, web servers, application servers, database servers, routers, switches, endpoints, and load balancers. If all logs from these devices are indiscriminately piped to a Security Information and Event Management (SIEM) system, it can lead to an overload of redundant data, increased storage costs, and performance degradation.

What Logs Should Be Piped to the SIEM?

To optimize SIEM performance, only logs that provide meaningful security insights should be ingested. Edge security devices such as IPS, IDS, firewalls, and web application firewalls (WAFs) should send logs related to denied connections, blocked threats, anomalous behavior like port scanning, excessive failed authentication attempts, and critical rule violations. Authentication and identity systems, including Active Directory and IAM solutions, should contribute logs regarding successful and failed login attempts, privilege escalations, permission changes, account lockouts, and excessive failed logins.

Web, application, and database servers should focus on unauthorized access attempts, application layer attacks like SQL injection and cross-site scripting, and unexpected privilege escalations or configuration changes. Endpoint detection and response (EDR) solutions and antivirus software should provide logs related to malware detections, remediation actions, behavioral anomalies that indicate possible compromise, and unusual command execution. Network infrastructure logs from routers, switches, and VPNs should capture significant configuration changes, unexpected topology changes, new device detections, and VPN authentication activity. These network logs should also be structured in a way that provides clear visibility into the traversal path of a potential attacker, ensuring that analysts can trace the movement of threats while avoiding excessive redundancy. Correlating network logs with endpoint and authentication data can provide a holistic view of an attacker’s movement through the environment. Finally, cloud security and API logs should highlight unauthorized API calls, security group misconfigurations, and large data transfers or suspicious access patterns.

Reducing Redundancy in SIEM Log Ingestion

To avoid redundancy, organizations should filter logs at the source to ensure that only security-relevant data is forwarded. Similar logs generated from multiple sources should be de-duplicated to prevent unnecessary storage and processing. Log aggregators, such as Fluentd and Logstash, can preprocess and normalize logs before they reach the SIEM. Additionally, setting appropriate log retention policies can help store verbose logs elsewhere while sending only high-priority events to the SIEM.

Leveraging a Data Lake Before Sending Logs to SIEM

Instead of sending all logs directly to the SIEM, a data lake can serve as a central repository for raw logs, with the SIEM selectively ingesting logs based on analytics and security correlation requirements. This approach provides several advantages. Cost efficiency is a key benefit since SIEM licensing costs are often based on log volume, and storing raw logs in a data lake first helps control costs. Data lakes also offer better long-term retention for compliance and forensic investigations compared to SIEM storage. Additionally, machine learning models can be applied to data lakes to identify patterns before logs are forwarded to the SIEM, enhancing security analytics. Logs can also be enriched with threat intelligence and contextual information before being ingested by the SIEM, improving detection accuracy. Furthermore, data lakes provide an excellent foundation for cyber threat hunting, allowing security analysts to conduct deep-dive investigations and identify anomalies that may not trigger immediate SIEM alerts. By leveraging historical data and correlating different data sources, analysts can proactively detect sophisticated attack campaigns.

Despite these benefits, using a data lake presents challenges. Setting up a data lake requires expertise in data engineering to properly structure and manage the system. Real-time threat detection may be delayed if logs are first sent to a data lake before reaching the SIEM. While data lake storage is generally cheaper than SIEM storage, excessive data accumulation can still lead to significant costs.

Moreover, implementing a data lake can be difficult if logs are already being sent directly to the SIEM. A transition may require logs to be concurrently sent to both the SIEM and the data lake before a full cutover, which affects all systems due to the necessary reconfiguration. This process may not be well received by the teams managing these systems, as it introduces additional workload and potential disruptions. To ensure a smooth transition, organizations may need to secure top-down support from management to mandate the necessary changes and allocate resources for implementation.

Conclusion

To optimize security log management, organizations should adopt a hybrid approach. Critical security logs should be directly piped to the SIEM for real-time monitoring and alerting, while a data lake should be used for bulk log storage, allowing selective ingestion of useful logs into the SIEM. Implementing log filtering and de-duplication ensures the SIEM remains efficient and avoids unnecessary overhead. Network logs should be structured to provide clear attack path visualization while maintaining efficient log volume management. Transitioning to a data lake requires careful planning, coordination, and management support to ensure a successful implementation. By integrating cyber threat hunting capabilities, organizations can proactively identify security risks and enhance their overall defense posture. This strategy enhances threat detection while keeping operational and storage costs under control.

Podcast also available on PocketCasts, SoundCloud, Spotify, Google Podcasts, Apple Podcasts, and RSS.

The Podcast

Join Naomi Ellis as she dives into the extraordinary lives that shaped history. Her warmth and insight turn complex biographies into relatable stories that inspire and educate.

About the podcast

Latest episodes

Cybersecurity Tips for a Safe Vacation

May 11, 2025
Decoding Deep Packet Inspection (DPI): How It Works and Why It Matters

February 9, 2025
What Is Deepfake Technology, and How Can You Identify It?

January 11, 2025
Singapore’s Move to Unmask NRIC: Why It Makes Sense and What It Means for You

December 22, 2024

Optimizing Log Ingestion for SIEM: Reducing Redundancy and Leveraging a Data Lake

What Logs Should Be Piped to the SIEM?

Reducing Redundancy in SIEM Log Ingestion

Leveraging a Data Lake Before Sending Logs to SIEM

Conclusion

Share this:

Leave a comment Cancel reply

The Podcast

Latest episodes

Cybersecurity Tips for a Safe Vacation

Decoding Deep Packet Inspection (DPI): How It Works and Why It Matters

What Is Deepfake Technology, and How Can You Identify It?

Singapore’s Move to Unmask NRIC: Why It Makes Sense and What It Means for You