Introduction to Elasticsearch

md-devdays.de 2016

Oliver Zeigermann / @DJCordhose

http://bit.ly/1TtgKNx

Contents

  • Elasticsearch
    • Search and Analytics Engine
    • Indexes and stores structured or unstructured data
    • Offers query language to search or aggregate data
  • Logstash
    • process data and store into Elasticsearch
    • Ruby based import description
  • Kibana
    • interactive querying
    • visualization (in dashboards)

Objectives of this talk

  • What is ElasticSearch? How does it related to a relational DB?
  • How can you import data into Elasticsearch?
  • How can you use Kibana to analyse and visualize your data in Elasticsearch?

Sample project

  • Have complete data of all US domestic flights from 2001
  • Process and store into Elasticsearch using Logstash
  • Visualize data using Kibana
  • Read slices of data to fuel standalone browser appplication

Elasticsearch

Kopf Plugin

  • Elasticsearch allows for plugins to enhance core functionality
  • Kopf allows for easy administration and introspection
  • Installation
    • Guide
    • sudo bin/plugin install lmenezes/elasticsearch-kopf/2.x
    • open http://localhost:9200/_plugin/kopf

Introspecting Elasticsearch using Kopf

Looking at physical entities

Cluster

nodes grouped under cluster name

There always is one master node

Node

Simple Elasticsearch instance

Coordinates access to shards

Index

logical grouping over shards

Shard

part of an index

can be distributed over many nodes for failover or performance

technically a Lucene index

Unhappy state

number of replicas can be set per index

1 for our example

if you have only one node, Elasticsearch is not happy

Recovering

you need at least two nodes

replicated shards initializing properly

Logical structures

Comparing to relational DB

Relational Database Databases Tables Rows Columns
Elasticsearch Indices Types Documents Fields

Example


PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25
}
            
  • index: megacorp
  • type: employee
  • document: <json body>
  • field(s): age

Sense - Making low level queries

Looking at our example using REST calls returning JSON

GET expo2009_airline/flight/_search

Excursion: Can Elasticsearch replace my relational database?

  • Kibana uses Elasticsearch in that way
  • You might not always need full text indices
  • Writing (indexing) takes a lot of time
  • Relational databases are great when you need joins
  • Elasticsearch might lose data
    • network partitions into two intersecting components
    • two nodes failing around the same time
    • even more scenarios
  • What Elastic says
    • Search
    • Analytics

Logstash

How to get data into Elasticsearch

  • you configure all this in a single import file
  • you run logstash using that file
  • e.g. ./bin/logstash -f expo2009_airline.conf
  • alternatives:
    • Pandas
    • Graylog
    • custom code

Looking at the conf file

code/expo2009_airline.conf

Logstash - Input


input {
      file {
          path => "ml/raw_data/expo2009_airline/2001.csv"
          type => "flight"
          start_position => "beginning"
	  codec => plain {
              charset => "ISO-8859-1"
          }
      }
}
            

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html

Logstash - Output


output {
    elasticsearch {
        action => "index"
        hosts => "localhost:9200"
        index => "expo2009_airline"
        workers => 1
    }
}
            

https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html

Filtering - CSV


filter {
    csv {
        columns => ["Year","Month","DayofMonth","DayOfWeek",
                "DepTime","CRSDepTime","ArrTime","CRSArrTime",
                "UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
                "CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin",
                "Dest","Distance","TaxiIn","TaxiOut","Cancelled",
                "CancellationCode","Diverted","CarrierDelay","WeatherDelay",
                "NASDelay","SecurityDelay","LateAircraftDelay"]
        separator => ","
    }
}
            

Filtering CSV

Filtering - Adding a Timestamp

Having a timestamp (field: @timestamp) makes data especially useful for Elasticsearch


filter {
    mutate { add_field => ["timestamp",
               "%{Year}-%{Month}-%{DayofMonth};%{CRSDepTime}"] }
    date {
            match => ["timestamp", "YYYY-MM-dd;HHmm"]
            target => "@timestamp"
    }
}
            

Filtering - Add types (optional)

Adding types makes querying faster and gives additional info for queries


mutate { convert => { "ActualElapsedTime" => "integer" } }
mutate { convert => { "CRSElapsedTime" => "integer" } }
mutate { convert => { "ArrDelay" => "integer" } }
mutate { convert => { "DepDelay" => "integer" } }
mutate { convert => { "AirTime" => "integer" } }
mutate { convert => { "Distance" => "integer" } }
mutate { convert => { "TaxiIn" => "integer" } }
mutate { convert => { "TaxiOut" => "integer" } }
mutate { convert => { "Cancelled" => "boolean" } }
mutate { convert => { "Diverted" => "boolean" } }
            

Kibana

  • generic frontend
  • browser based
  • allows for dashboards
  • also allows to make arbitrary adhoc queries

Discover Data

using adhoc queries

Demo

Flights from LA or Newyark to Denver or Clinton between September 10th 2001, 11:12 and 15:38?

Flights Dashboard #1

Clicks trigger requests, responses update graphics

Demo

Adding a Departure Delay Barchart

Benefits

  • works on any size of data
  • with a little practice very easy to do
  • provides widgets out of the box

Drawbacks

  • limited in layout and widgets
  • each update takes time
  • smooth and continuous interaction limited

Flights Dashboard #2

Dumping 400,000 data sets

All data offline, filtered in browser

Accessing Elasticsearch via REST

Problem: Making data calls from a web browser is not allowed because of Same-Origin-Policy (SOP)

Option 1: Use a (relay) web server to make calls to Elasticsearch

Option 2: enable Cross-Origin Resource Sharing (CORS) to allow direct access from browser


http.cors.enabled: true
http.cors.allow-origin: '*'
            

Network, Timing, Data Sizes

Wrap-Up

  • Elasticsearch is a search engine based on Lucence
  • interface are REST calls returning JSON
  • works in a cluster for failsafety and performance
  • especially good as secondary datastore
  • very good with timelines
  • logstash can help importing data into Elasticsearch
  • Kibana can be used to explore and visualize your data
  • Kibana allwos for generic dashboards
  • you can create your own (visual) applications using data from Elasticsearch

Thank you!

Questions / Discussion

Oliver Zeigermann / @DJCordhose