X15 is a machine data platform that has been designed from the ground up to store massive quantities of semistructured data, and allow interactive search and complex analysis over it. X15 uses multi-level parallelism to produce query results as quickly as possible. In this post, we will look at the benefits of multi-level parallelism and its impact on performance and scalability.
X15 Enterprise™ Architecture
Before jumping into the details of parallel query processing in X15 Enterprise, let us quickly take a look at its architecture. X15 Enterprise is deployed in a Hadoop cluster, and uses HDFS for storing data as well as indexes. The X15 server runs as a single process on each machine in the cluster, and is capable of performing the following operations:
- Data parsing
- Query evaluation
- Client Connectivity through REST/Websocket and JDBC/ODBC
X15 Enterprise uses a peer-to-peer architecture to ensure that all the above operations are automatically handled in a balanced, fault-tolerant manner by all the nodes in the cluster. This makes X15 Enterprise extremely easy to make operational without the need for system administrators to manually load balance different kinds of processes that play specialized roles such as Splunk indexers and search heads.
Every X15 node in the cluster is capable of accepting client connections over every protocol (REST, WebSockets, JDBC, ODBC). Internally, each X15 server automatically load-balances the work offered, thus avoiding hotspots.
Of specific importance to the topic of this post is the ability of each X15 server to index and store data and also to evaluate queries. This design allows X15 to parallelize every part of a query, which directly translates to low response times (performance) and linear scalability as nodes are added to the cluster.
Machine Data Analysis = Search + Analysis
Analyzing machine data usually begins with search, where the user narrows down the data to the subset of interest using simple keywords. For example, imagine we have a server that runs sshd that generates security logs, like the snippet shown below.
Every failed login is captured in this log to include the IP Address of the host that tried to login to our server. While some of these entries reflect honest mistakes made by valid users while typing in their password, a majority of failed logins are due to dictionary attacks where attackers are trying to log in by guessing usernames and passwords. An analyst may want to count how many failed logins were found in the log for each remote IP address since attackers are likely to generate a large number of failed logins from the same IP address. Then the analyst might want to map each IP address to the country of origin.
First, the analyst would start by isolating the records that indicate that a login failure occurred. This can be achieved by performing a simple search operation for “failed password”. As the next step, the analyst would have to extract the offending IP Address from this log record. Here we assume that such field extraction was not performed upfront (X15 allows for field extraction while the data is being loaded into the system, or afterwards when the data is being analyzed. This is a topic for another post). The next step after the records have been identified by the search and the IP address has been extracted, is to perform the count for each IP address. Remember that on a popular public server, the number of unique IP addresses will be extremely large. As the last step, the analyst would use a Geolocation database that maps IP address blocks to their location to look up the location of each IP address. This lookup process entails a range search within blocks of IP addresses and X15 implements some innovative algorithms to perform this lookup very efficiently. Again, this is the topic of another blog post.
The example shown above, while being simplistic, still drives home the point that search is an important component in analyzing log data. However, simply being able to search is hardly sufficient to solve the challenges facing security analysts. It is imperative that a machine data platform be able to perform aggregations and correlation with other data sets during analysis. In the rest of this post we will focus on parallelism and its impact on scalability for this simple example.
X15 Enterprise™ Intra-Query Parallelism
Intra-query parallelism refers to the ability to decompose a single query into fine-grained tasks so that each task can be performed in parallel, thus reducing the total time to completion. X15 uses the power of the entire cluster to parallelize every step of the query. While collectively, the entire cluster stores all the log entries in the security dataset, any one node is responsible for only a part of that dataset. When a search is performed on this data, each node performs the same search on the part of the data that it owns. IP address extraction is performed by each node only for the records that the search step returned on that node. The next step of counting the number of occurrences of each IP address on each node separately yields the partial result of the counting operation, and needs to be aggregated across all machines to get the final counts. However, all the steps listed so far are performed on each node, in parallel, taking N nodes 1/Nth of the time it would have taken a single machine to perform the same steps. So far this is no different from Splunk functionality (ElasticSearch and SolR cannot do this). The difference that X15 brings to this problem is in the parallelization of the next steps in the query.
The same IP address may occur on multiple nodes in the cluster since we only counted the number of times an IP address appeared in the data that each machine was responsible for. An aggregation step of all the partial counts is required to compute the final counts for each IP address. Each X15 node splits the partial results into smaller parts so that the IP address and its partial count are sent to all the nodes in the cluster. This step is performed in such a way that every node sends the same IP Address to the same node, in essence making each node responsible for computing the final answer for a subset of the IP Address values. Finally, each node performs the geo-location mapping for the subset of the IP Addresses it was responsible for, again performing this step in parallel. Note that the final aggregation step and the geo-location mapping steps are performed in parallel, taking N nodes only 1/Nth of the time it would take a single node to compute the same result. Such an architecture fully utilizes the underlying network and distributes work evenly to the nodes, avoiding any hotspots while processing the query.
The parallelization strategy described here is very similar to that of an MPP database. In contrast, search-based technologies would perform the final aggregation and the expensive geo-location lookup steps at a single node, as shown below, unable to exploit the scalability of the entire cluster. This creates a hotspot at the search head — the single node that does all the operations after the initial parallel step. The problem becomes more acute as the number of unique IP addresses increases, because of the sequential last steps.
We will now see the impact of X15 multi-level parallelism on the query scalability. But first, we will take a tour of Amdahl’s law; the law that governs the speedup achieved by a task, as we add more processing nodes.
Amdahl’s Law and its Speedup Consequence
Amdahl’s Law states that the maximum theoretical speedup (change in time to completion, as we add more processing resources) is determined by what fraction of the task is parallelizable. Specifically, as we add more processors, only that part of the task that can be parallelized gets faster. The sequential part of the task still takes as long, in spite of the number of processors in the system. The graph below shows the impact of this law on time to completion of a query as the parallel portion of the query changes. The blue line, for example, shows that if half of a task is parallelizable, then using two processors makes the task only 33% faster. Using three processors makes the task 50% faster than using a single processor. The best speed improvement to this task we can hope for is no more than 100% by using close to 16 processors or more.
Amdahl’s Law underscores the importance of making every part of a system as parallel as possible to achieve the best speedup as more processors are used. Since X15 parallelizes every step of the example query, the query shows true linear speedup as more nodes are added to the cluster. On the other hand, the architecture that utilizes a single search head for the final steps, moves closer to the blue line in the graph, showing very little improvement in query completion times as more nodes are added to the cluster.