Baselining and anomaly detection are security concepts that have been around for quite a while, however, recently both have received renewed interest. This new attention stems from increased regulatory focus on incident response and that in today’s cybersecurity world it’s no longer a question of “if” but “when.” Cyber-attacks have evolved to the point where they can pass through technical defenses, blend into an environment and remain undetected as long as necessary to carry out their malicious objective. Detecting these covert assaults requires vigilant monitoring and a thorough understanding of what is considered “normal” behavior in an IT environment, including both systems and users. Identifying normal vs. abnormal behavior is the foundation for Security Information and Event Management (SIEM) that has created an entire industry within the security profession. It’s worth noting that SIEM systems are only effective when properly configured and have visibility to complete and accurate system information over a period of time. Without an accurate historical record of normal IT operating activity it’s next to impossible to detect minute variations in performance or behavior that may subtly signal malicious activity.
For many IT organizations defining “normal” is a difficult task, even for those with advanced monitoring systems. Such difficulty can stem from incomplete and/or inaccurate audit logs, misconfigured systems or simply a lack of system reporting across the environment. For example, some organizations may have extensive network performance information, yet may be missing critical host information and therefore unable to determine if increased network traffic is the result of a misconfigured or unauthorized device on the network. For those that do have this data the next challenge is how to collect and analyze it properly to provide a fairly accurate representation of “normal” (baseline) so that any deviations (anomalies) can be quickly and easily recognized. Again, for many organizations this can be a difficult task since it can involve obtaining and processing large volumes of data from many different sources. As is the case with many large projects, breaking it down into small components can be helpful. Here are some fundamental steps to help get started establishing a baseline in your environment.
At a high-level baselining means tracking many attributes at multiple levels across all systems at various times to know what is happening or should be happening at any given time. As noted in the example above, this goes beyond just network monitoring, it also includes host, application and user behavior monitoring. For example, knowing that a particular device is supposed to interact with a specific thing (device, user etc.) at specific place, at a specific time, and only do a specific thing, should trigger an investigation if any unexpected deviation occurs. Attributes of even the most precise systems usually have some level of deviation at various points so establishing a range of tolerability is also a key to baselining. The only way to have “certainty” is to have a complete understanding of the requirements and a record of past behavior including known deviations. In simple terms it’s all about collecting, retaining and analyzing data. The following are key ingredients and tasks that must be performed to establish a meaningful baseline useful for anomaly detection.
- Asset Inventory – The first step is to know your environment inside and out, which means having a current, comprehensive, and accurate IT asset inventory and network topology. This should include all hardware, software, internal, and external network connections, configurations, etc.
- Perform a risk assessment to determine which assets are most critical, have the highest risk and the most activity. This will help focus the effort and reduce information overload during the data collection process.
- Logging – The second critical piece is logging. Logs are fundamental and an absolute necessity to baselining and anomaly detection. It’s obviously where the majority of system and user activity information is obtained. The hard part is knowing what to log and how much. In simple form a log is a collection of messages generated in response to some action. For a log to be useful it must:
- Include a synchronized time stamp for each event and data source;
- Contain sufficient detail to identify a wide range of system activity;
- Be archived; and
- Include Type identifiers: Informational, debug, warnings, errors, and alerts.
Log sourcing depends on the IT assets in place, their use and risk rating. Below is a list of common log sources and types:
- Application log: these are events logged by applications.
- Security log: this log contains records of valid and invalid logon attempts and events related to resources use, such as creating, opening, or deleting files or other objects. This log is also customizable.
- System logs contain system event information (e.g. driver failures and hardware issues).
- Domain controllers have directory service logs that record events from Active Directory and related services.
- File Replication service logs containing Windows File Replication service events. Sysvol changes are recorded in the file replication log.
- DNS servers also store DNS events in the logs.
- System User Logon/Logoff.
- Local Account Logon/Logoff.
- USB Connections and other removable media inserted into Workstations.
- End-User Desktop programs.
- Network Infrastructure devices (Routers, Switches, Firewalls, and other Security solutions)
- Logins and Logouts.
- Connection established to the service.
- Bytes transferred in and out.
- Configuration changes.
- Gathering – Once you have determined what logs you need to accurately report behavior of your systems the next step is collection and maintenance. Log files should be retained, preferably on a centrally managed, dedicated and highly restricted server. These records serve as historical records critical for establishing baselines and potential forensic analysis. Methods for transferring logs to such device depends on the particular device and native OS, however, a few common transfer protocols are listed below.
- Windows Event log (Microsoft proprietary)
- 4. Analysis – Now the fun part, and really where the process of baselining, begins. You must know your objective, for each system (i.e., server, network device, application, etc.) you must determine: What is the purpose of the system; what are the key attributes to monitor, and what are the key indicators that demonstrate the system is functioning normally. For example, does a webserver provide the correct page at the right time with the right information, based upon business requirements? Obviously if not something is wrong, but what is wrong? Looking at the logs to find out is easier said than done unless the following are in place:
- Filtering – Remove the noise (false positives) and unnecessary log entries to get to the meaningful messages.
- Data verification – Compare recent log entries to history to look for patterns and determine accuracy of logs.
- Automate – To the extent possible, automate processes to compare and review (e.g. scripting tools, etc.).
- Correlation – Look across the environment for similar occurrences or patterns of “unusual” behavior.
Anomalies (deviations from baseline)
Now that you have a baseline established from a period of historical records of activity and are fairly confident that “normal” has been defined, deviations can be identified and used to determine if malicious or unauthorized activity is occurring. Below are some examples of common anomaly indicators.
- Never seen before internal IP’s, accounts, devices etc.
- Large deviations in metrics up or down (e.g. network traffic, log in attempts, helpdesk tickets, etc.).
- Multiple Antivirus failures.
- Remote logins outside of “normal” time/large data transmission(s).
- Multiple invalid access attempts to restricted resources (e.g. file shares).
- IPs as a source with a high number of unique destinations (e.g. sweep or scan).
- Your database server suddenly has a lot of outbound traffic.
- One of your workstations is trying to connect to too many other hosts at once.
- Your file server is very busy in the middle of the night when it is normally idle.
- You are suddenly getting a lot of firewall hits from a country that you do not conduct business.
- Active Directory is reporting multiple invalid login attempts from an unknown user.
- Multiple login failures followed by a successful login.
IT production environments change every day and understanding “normal” is most certainly difficult, however, by obtaining complete and accurate information from as many key systems as possible baselines can be established and provide critical references for recognizing anomalies. Time must be invested in gathering and analyzing data and tracking over time to determine how often and to what extent changes in the IT environment occur to be able to know “yes we have seen this before, or no this is not normal and we should look into this.” Based on today’s current cyber threat environment, the concept of baselining is no longer a “nice to have” and is now a “must have” to make fast educated decisions about production incidents.
Below is a high-level pictorial representation of Baselining and Anomaly detection processes.
Authored by: Stan Skwarlo