« Back to All Blog Posts

The Power of JOIN For Large-Scale Searches

In this post, we introduce X15 basics that we will discuss in future posts—taking it from the very top.

X15 is a data management system that looks a lot like a relational database but has many new tricks up its sleeve to address the challenges of event or log data (we will use the term event data from now on). We also define X15 as an event data warehouse because it shares many similarities with what can be described as a traditional data warehouse, such as SQL queries, open interfaces, interoperability with well-established BI tools and custom applications. However, X15 is designed from the ground up to operate on streaming, machine-generated event data.

While search-based tools have dealt with event data for a while, they are designed for DevOps IT professionals and are quite difficult for traditional business users to master. Consumers of the results provided by search tools are often limited to a single user interface or require development of custom Web applications, which requires expertise and can cause delays in providing business value. X15 strives to apply a familiar and proven paradigm to jumpstart event data analysis by presenting the user with databases, tables, views and ANSI SQL 2003 as the language to query the data.

Event data comes in many file formats. Some formats are self-describing, such as CSV, JSON, XML or AVRO. These formats, in addition to containing actual data, house the underlying structure of the data, which can be used to derive a schema. Other formats are in unstructured text form and through application of some intelligent process can be parsed to extract fields to construct a schema. We will discuss how X15 derives schemas from all these formats in a future post.

One issue arises in parsing fields from unstructured text: Are you really getting all the information you need from the resulting fields? Are you missing something? Well, you may miss something unless you store the original unaltered record in a field alongside the extracted fields—which is what X15 does. We call this all-encompassing field the _raw field. X15 also takes care of optimizing redundant data by applying intelligent compression, which results in an average of 10x storage savings against event data.

The _raw field is tokenized by way of a text index, enabling searches against that field to essentially search the entire table. The next question you may ask is “How do you do a search on this _raw field with SQL? I don’t see it.” Well, you are right in assuming that a LIKE expression in a WHERE clause is not going to cut it. You need a little X15 secret sauce to do searches on _raw. Let’s look at how we implemented the ability to search within a SQL statement.

X15 chose the Lucene query syntax to enable search queries to be used as a filtering mechanism for a SQL statement. We chose Lucene because of its pervasiveness and prevalence in the search space. To integrate a Lucene search query, we had to teach SQL a new trick. Enter the @@ operator.

This is how the @@ operator works: You write a SQL statement and filter the statement with a WHERE clause. Translation: You want some, or all, fields returned from a table where the records meet a certain criteria. Now let’s build criteria based on an embedded Lucene search query, as shown:

SELECT * FROM events WHERE _raw @@ '"Mozilla Gecko"'

We’re requesting all columns in the events table where the _raw field contains a phrase with two words: Mozilla and Gecko. If those two words are not right next to each other and in the right sequence, no records are returned. So let’s change the criteria by using a Lucene proximity search:

SELECT * FROM events WHERE _raw @@ '"Mozilla Gecko"~9'

Now we’re looking for records with both Mozilla and Gecko in the _raw field where those two words are, at most, nine words apart, regardless of the order.

The preceding example only scratches the surface. Take a look at the Lucene parser syntax for more advanced search capability: http://lucene.apache.org/core/5_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

Adding Lucene search query syntax, while interesting, is not the most amazing part of integrating search and SQL. The following example demonstrates the merits of a strong integration of both technologies.

First, let’s build our example by starting from the beginning with a simple use case. We’d like to get all the records in a table where the following IP address is present:

If we’re not using X15 and don’t have access to something like the _raw field to search on, we would need to look for the IP address in the columns resulting from the parsing process executed at load time. But what if there are hundreds of fields to choose from in the events table? Which one(s) should we use to search against? Are we sure the records are correctly parsed so that the IP address ends up in the appropriate field?

As we’ve established earlier, X15 solves the problem associated with parsing by providing the _raw field where we can be sure to find an IP address if it’s in any of the events. One single field to rule them all.

Because every number in an IP address is individually tokenized in the text index against the _raw field, we express the IP address as a phrase in Lucene terms. In X15, the IP address should be enclosed in double quotation marks for use in a SQL statement. The Lucene search term itself must be enclosed in single quotes, because literals are passed as a string. The complete query syntax in X15 is as follows:

SELECT * FROM events WHERE _raw @@ '""'

But what if we have thousands of IP addresses to search for? I don’t think any of us would have the patience to run thousands of individual queries one by one. (I’m being facetious here …)

Some search technologies can search against a list of IP addresses at the time of ingestion, but what about searching older events—possibly going back many months— against a recently downloaded list of potential harmful IP addresses? Enter the power of the SQL JOIN.

Our first table, the events table, contains the _raw field. A second table, called bad_ips, contains a field named ipv4 with 7,000 unique IP addresses. Here’s how we would logically write the query to search for those 7,000 IP addresses against the _raw field in events:

SELECT * from events JOIN bad_ips on _raw @@ ipv4

Unless the ipv4 field already encloses the IP addresses in double quotation marks, we must use double quotation marks for the query to work properly. Remember, those numbers are individual tokens and we’re looking for a phrase. Here is the final query that properly executes in X15:

SELECT * from events JOIN bad_ips on _raw @@ '"' + ipv4 + '"'

There you have it: a simple way to express a search across unstructured text to find records that contain values stored in a different table. We can search for a recent value in older events or for older values in recent events, independently of when they were ingested (parsed and indexed).

In future posts, we will introduce additional concepts such as “schema on ingest,” dynamic table structures, how to connect to X15 with any query tool that supports PostgreSQL and much more. Stay tuned.