Indexing data in X15 boosts performance significantly, reducing the latency of many queries that use indexed fields by several orders of magnitude.
events table with the fields
starttime (timestamp) and
Let’s say we want to find the entries where the title of the event has a word starting with “blog,” such as “blog,” “blogger” or “blogosphere.” In the absence of indexes, we can issue the query:
select * from events where lower(title) like '% blog%' or lower(title) like 'blog%';
The problem with this query is that it scans the entire
events table, which can contain many billions of records, looking for titles where the “blog” letter sequence appears. Even if only a few number of records match, the query engine must still perform a full scan because it does not have another way to get to those records faster.
Indexes are a way to accelerate finding the answers for these kinds of queries. In our example, we could create a text index on the
title field. To handle text indexes, X15 comes with a set of predefined tokenizers as well as the ability to create custom ones. Here, we use the Simple tokenizer type, which splits text at non-letter characters and converts them to lower case.
Having a text index enables use of the text search operator
@@, which can take the Lucene search query ‘blog*’:
select * from events where title @@ 'blog*';
This SQL query does work that is proportional to the number of matching records and avoids scanning the entire table. The following figure intuitively describes the difference between the execution plan (a) that reads and filters all the records in the table versus the plan (b) that probes an index to find pointers for the matches and then retrieves the records from storage.
X15 also supports indexing field values as value indexes that are used to speed up equality and comparison filters over those fields. They are similar to indexes in a relational database and support most primitive types, such as integer, string and timestamp. With value indexes, we can also use range filters, which are not supported by text indexes.
For instance, if there is a value index on
starttime, a query that asks for events with a starting time between 1 pm and 2 pm (GMT) on May 8, 2015:
select * from events where starttime >= '2015-05-08T13:00:00+00:00'::timestamp and starttime < '2015-05-08T14:00:00+00:00'::timestamp;
can use that index to efficiently retrieve the records in that range.
Suppose now that the
events table—instead of being loaded in X15—resides in a legacy Hive database, stored in HDFS using the ORC file format. Can we reap the benefits of indexing without moving or duplicating the data?
The solution comes under the form of a new X15 feature, unmanaged tables, that is presented in the following figure:
An unmanaged table is defined by pointing to the Hive metastore (the hive connection) and specifying a database and a table name. X15 can query these tables, and we can declare text and value indexes on them, as we do for managed tables. In the previous example, we create the same text index on
title that helps in answering the
title @@ 'blog*' query. The index is used by the query engine in the same way a managed index is, except that the records are read from Hive storage.