[번역] Solr vs ElasticSearch Part 4 - Faceting

Jul 17th, 2013

Solr vs ElasticSearch: Part 4 – Faceting

October 30, 2012 by Rafał Kuć

Solr 4 (aka SolrCloud) has just been released, so it’s the perfect time to continue our ElasticSearch vs. Solr series. In the last three parts of the ElasticSearch vs. Solr series we gave a general overview of the two search engines, about data handling, and about their full text search capabilities. In this part we look at how these two engines handle faceting.

Solr 4 (또는 SolrCloud) 가 릴리즈되었습니다. ElasticSearch vs. Solr 시리즈를 계속하기에는 완벽한 시기라고 말할 수 있겠죠. ElasticSearch vs. Solr 시리즈의 이전 3편에서는 두가지 검색 엔진 전체의 개요, 데이터 취급, 전문검색능력에 대해서 이야기했습니다. 이번 파트에서는 두 엔진이 어떻게 Faceting 을 다루는지를 알아봅니다.

Faceting

When it comes to faceting, both Solr and ElasticSearch have some faceting methods that other search engine does not. Both search engines allow you to calculate facets for a given field, numerical range, or date range. The key differences are in the details, of course – in the control of how exactly the facets are calculated, in the memory footprint, and whether we can change the calculation method. In most cases ElasticSearch allows more control over faceting, however Solr has some serious advantages, too. Lets get into details of each of the methods.

faceting 에 대해서는 Solr 와 ElasticSearch 양쪽이 다른 검색엔진이 가지지 않은 몇가지 faceting 메소드를 가집니다. 두 검색엔진 모두가 주어진 필드, 수치범위, 날짜범위의 Faceting 을 계산할 수 있습니다. 주요한 차이는 실제로 Faceting 을 어떻게 계산할 것인가하는 컨트롤, 메모리풋프린트, 그리고 계산방법을 변경할 수 있는가 입니다. 대부분의 경우 ElasticSearch 는 faceting 에 있어서 다채로운 조작이 가능합니다. 그러나 Solr 도 몇가지 중요한 우위점을 가지고 있습니다. 각각의 메소드의 자세한 내용에 들어가 보겠습니다.

Term Faceting

This method of faceting allows one to get information about the number of term occurrences in a certain field.

이 faceting 메소드를 통해 어떤 필드안에 어떤 용어가 몇개인가 하는 정보를 얻을 수 있습니다.

Solr

Solr let’s you control how many facets are returned, how they are sorted, the minimum quantity required, and so on. In addition to that, in Solr field faceting, you can choose between a couple of different methods for computing facets. One of these method should be used for fields with a high number of distinct terms, while the second method is best used in the opposite scenario – when you expect relatively few distinct terms in a field being faceted on.

Solr 는 Facet 이 몇개 반환되는가, 그것들이 어떻게 저장되는 가, 최소필요수등을 컨트롤할 수 있습니다. 그리고 Solr 의 필드 faceting 에서는 Facet 계산법을 두세가지의 다른 방법 중에서 선택할 수 있습니다. 이런 메소드 중 하나가 가장 수가 많은 용어로 사용되어, 두번째 방법은 역 시나리오에 적합합니다. 필드 안에 상대적으로 적은 용어가 Facet 되기를 바라는 경우죠.

ElasticSearch

On the other side we have ElasticSearch which allows us to do all that Solr can do (in terms of faceting calculation, not the calculation methods), but in addition it also let’s us exclude specific terms we are not interested in and use regular expressions to define which terms will be included in faceting results. In addition to that we can combine term faceting results from different field automatically or just use scripts to modify the fields values before the calculation process steps in

반면에 ElasticSearch 에서는 Solr 가 가능한 모든 것이 가능합니다. (Facet 계산에 관해서이며 계산방법은 그렇지 않습니다) 그리고 흥미없는 특정 용어를 배제할 수 있으며, Facet 결과에 포함된 용어의 정의에 정규표현을 사용할 수 있습니다. 또 다른 필드의 용어 faceting 의 결과를 자동적으로 연결하거나, 계산프로세스에 들어가기 전에 필드의 값을 변경하는 스크립트를 사용하거나 할 수 있습니다.

Query Faceting

Both Solr and ElasticSearch allow calculating faceting for arbitrary query results. In both cases queries can be expressed in the query API of the search engine which we use. For example, in ElasticSearch you can use the whole query DSL to calculate faceting results on them.

Solr 와 ElasticSearch 양 쪽 모두 임의의 쿼리의 결과에 대해서 Facet 을 계산할 수 있습니다. 양쪽 모두 쿼리는 검색엔진의 쿼리 API 로 표현할 수 있습니다. 예를들어 ElasticSearch 에서는 쿼리DSL 전체를 그 결과의 Facet 의 산출에 사용할 수 있습니다.

Range Faceting

Range faceting lets you get the number of documents that match the given range in a field. Both engines allow for range faceting although in different fashion.

Range Faceting 은 어떤 필드에 주어진 범위에 일치하는 도큐먼트의 수를 반환합니다. 두 엔진 모두 Range Faceting 이 가능하지만 서로 다른 방법으로 동작합니다.

Solr

Apache Solr lets you define the start value, end value, and the gap (with some adjustments like inclusion of values at the end of the ranges) and calculate all the ranges defined by that.

Apache Solr 는 시작과 끝의 값, 갭(과 몇가지 조정할 것, 예를들어 범위의 끝의 값은 포함되는가하는 것 이외 기타 등등)을 지정하고 그에 의해 정의된 모든 범위를 계산합니다.

ElasticSearch

ElasticSearch takes a different approach – it lets you specify set of ranges and returns document counts as well as aggregated data. In addition to that, ElasticSearch let’s you specify a different field to check if a document falls into a given range and a different field for the aggregated data calculation. Furthermore, you can modify the field and aggregated data with a script. And that’s not all – in addition to the above method of range faceting ElasticSearch also supports the so called histogram calculation. This is similar to the Apache Solr approach – for a given field you can get a histogram of values. However, ElasticSearch doesn’t let you control the start and end like Solr does, but only the gap.

ElasticSearch 는 다른 접근방법을 취합니다. 범위의 집합을 지정해서 도큐먼트의 카운트와 집약된 데이터를 반환합니다. 그리고 ElasticSearch 는 다른 필드를 지정해서, 도큐먼트가 지정된 범위에 해당하는지 체크하는 것 이외에 집약데이터의 계산에 다른 필드를 지정할 수 있습니다. 또한, 필드와 집약데이터를 스크립트로 변경할 수 있습니다. 그것뿐만 아니라 위의 Range Faceting 에 이어 ElasticSearch Histogram 계산이라고 불리는 것도 이용할 수 있습니다. 이는 Apache Solr 의 접근방법과 비슷하게 주어진 필드의 값의 히스토그램을 얻을 수 있습니다. 그러나 ElasticSearch 는 Solr 처럼 시작값과 마지막값을 컨트롤할 수 없으며, 갭만 가능합니다.

Date Faceting

Again, both search engines support faceting on date based fields.

마찬가지로 두 검색엔진은 필드에 기반한 날짜 Faceting 을 지원합니다.

Solr

Date faceting in Apache Solr is quite similar to the range faceting, although it is calculated on fields of solr.DateField type. You have the same control and use similar parameters as withing the range faceting so I’ll omit describing it.

Apache Solr 의 Date Faceting 은 Range Faceting 과 완전히 같지만 solr.DateField 타입의 필드상에서 계산됩니다. Range Faceting 과 같은 컨트롤이 가능하고, 같은 파라메터를 사용할 수 있습니다. 그때문에 여기에서는 자세한 서술은 생략합니다.

ElasticSearch

On the other hand, we have ElasticSearch with its date faceting which is an enhancement over the standard histogram faceting. It supports interval specification with date specific parameters like for example: year, month, or day. In addition to that, ElasticSearch lets you specify the time zone to be used in computation and of course manipulate the calculation with the use of a script.

반면 ElasticSearch 에서 Date Faceting 은 표준 Histogram Faceting 의 확장으로 동작합니다. 날짜를 특정하는 year, month, day 같은 파라메터와 간격을 지정할 수 있습니다. 그리고 ElasticSearch 는 연산에 사용되는 타임존 지정이 가능하며, 물론 스크립트를 이용해서 계산을 컨트롤할 수 도 있습니다.

Decision Tree Faceting – Solr Only

One of the things that ElasticSearch lacks and that is present in Solr is the pivot faceting aka decision tree faceting. It basically lets you calculate facets inside a parents facet. For example, this is what pivot faceting results look like in Solr (n.b. this example is trimmed for this post) :

현시점에서 Solr 에 존재하지만 ElasticSearch 에서는 빠진 것 중 하나가 pivot Faceting 또는 결정트리 Faceting 이라고 불리는 것입니다. 이것은 부모 Faceting 내부에서 Faceting 을 계산할 수 있습니다. 아래가 Solr 에서 Pivot Faceting 이 어떻게 동작하는 지에 관한 예제입니다(이 예제는 이 글을 위해서 일부 삭제하였습니다).

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="facet">true</str>
    <str name="indent">true</str>
    <str name="facet.pivot">inStock,cat</str>
    <str name="q">*:*</str>
    <str name="rows">0</str>
  </lst>
</lst>
<result name="response" numFound="32" start="0">
</result>
<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields"/>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
  <lst name="facet_pivot">
    <arr name="inStock,cat">
      <lst>
        <str name="field">inStock</str>
        <bool name="value">true</bool>
        <int name="count">17</int>
        <arr name="pivot">
          <lst>
            <str name="field">cat</str>
            <str name="value">electronics</str>
            <int name="count">10</int>
          </lst>
          <lst>
            <str name="field">cat</str>
            <str name="value">currency</str>
            <int name="count">4</int>
          </lst>
          .
          .
          .
        </arr>
      </lst>
      <lst>
        <str name="field">inStock</str>
        <bool name="value">false</bool>
        <int name="count">4</int>
        <arr name="pivot">
          <lst>
            <str name="field">cat</str>
            <str name="value">electronics</str>
            <int name="count">4</int>
          </lst>
          <lst>
            <str name="field">cat</str>
            <str name="value">connector</str>
            <int name="count">2</int>
          </lst>
          .
          .
          .
        </arr>
      </lst>
    </arr>
  </lst>
</lst>
</response>

Statistical Faceting

Both ElasticSearch and Apache Solr can compute statistical data on numeric fields – values like count, total, minimal value, maximum value, average, etc. can be computed.

ElasticSearch 와 Apache Solr 양쪽 모두가 통계 데이터를 수치 필드 위에서 계산할 수 있습니다. 예를 들어, 카운트, 총계, 최소값, 최대값, 평균값 등을 계산할 수 있습니다.

Solr

In Apache Solr the functionality that enables you to calculate statistics for a numeric field is called Stats Component. It returns the above mentioned values as a part of the query result, in a separate list, just as faceting results.

Apache Solr 에서는 수치필드에서 통계를 계산하는 기능은 stats 컴포넌트라고 불립니다. 위에서 적은 값을 쿼리의 결과의 일부로 별도의 리스트 안에 Faceting 결과로 반환합니다.

ElasticSearch

In ElasticSearch this functionality is called Statistical Facet. You should keep in mind thought that, as usual, ElasticSearch allows us to calculate this information for values returned by a script or combined for multiple fields, which is very nice if you need combined information for two or more fields or you want to do additional processing before getting the data returned by ElasticSearch.

ElasticSearch 에서는 이 기능은 statistical Faceting 이라고 부릅니다. 기억해두어야 할 것은 ElasticSearch 는 이 정보를 스크립트의 반환값에 대해서 계산하거나, 여러 필드의 결합 시에 계산할 수 있다는 것입니다. 2개 이상의 필드의 결합정보가 필요하거나, ElasticSearch 에서 반환된 값을 얻기 전에 추가적인 처리를 실행하고자 하는 경우에 매우 편리합니다.

Geodistance Faceting

(Geo)spatial search is quite popular nowadays where we try to provide the best search results we can and we considering multiple pieces of information and conditions. Of course both Apache Solr and ElasticSearch provide spatial search capabilities, but we are not talking about searching – we are talking about faceting. Sometimes there is a need to return a distance from a given point, just to show that in our application – and we can do that both in ElasticSearch and Solr.

(Geo)spatial 검색은 요즘 매우 인기가 있습니다. 우리는 가능한한 수치의 정보와 조건에 의한 최고의 검색결과를 얻을 수 있도록 노력하고 있습니다. 물론 Apache Solr 와 ElasticSearch 양쪽 모두 spatial 검색이 가능합니다. 그러나 지금은 검색이야기가 아니라 Faceting 이야기죠. 우리는 어플리케이션에서 때때로 주어진 지점에서 거리를 반환할 필요가 있습니다. ElasticSearch 와 Solr 모두 이런 것이 가능합니다.

Solr

In Solr to be able to facet by distance from a given point we would have to use facet.query parameter and use frange or geofilt, for example like this:

Solr 에서는 주어진 지점에서 거리로 Faceting 할 수 있습니다. facet.query 파라메터를 사용할 필요가 있으며 frange, geofilt 를 아래처럼 사용합니다.

q=*:*&sfield=location&pt=10.10,11.11&facet=true&facet.query={!geofilt d=10 key=d10}

This would return the number of document within 10 kilometers from the defined point.

이것은 정의된 지점에서 10km 이내의 도큐먼트의 수를 반환합니다.

ElasticSearch

ElasticSearch exposes dedicated geo_distance faceting type that lets us pass the point and the array of ranges we want the distance to be calculated for. An example query might look like this:

ElasticSearch 는 geo_distance Faceting 타입이 공개되어 있고, 이것을 사용해서 위치와 계산대상의 거리 범위를 배열로 건넵니다. 쿼리의 예제는 아래와 같습니다.

{
 "query" : {
  "match_all" : {}
 },
 "facets" : {
  "d10" : {
   "geo_distance" : {
    "doc.location" : {
     "lat" : 10.10,
     "lon" : 11.11
    },
    "ranges" : [
     { "to" : 10 }
    ]
   }
  }
 }
}

In addition to that, we can specify the units to be used in distance calculations (kilometers and miles) and the distance calculation type – arc for better precision and plane for faster execution.

그리고 거리의 계산에 사용하는 단위(Km 또는 마일)와 거리의 계산타입을 지정할 수 있습니다. arc 는 보다 나은 정밀도, 그리고 plane 은 계산이 빠릅니다.

Solr, LocalParams and Faceting

One of the good things about faceting in Solr is that it allows the use of local params. For example, you can remove some filters from the faceting results. Imagine you have a query that gets all results for a term ‘flower’ and you only get results that fall into ‘cloth’ category and ‘shirt’ subcategory, but you would like to have faceting for tags field not narrowed to any filter. With the help of local params this query may look like this:

Solr 의 Faceting 에 있어서 좋은 점 중 하나는 로컬파라메터를 사용할 수 있다는 것입니다. 예를들어 Faceting 의 결과에서 몇가지 필터를 삭제할 수 있습니다. flower 라는 용어에 대해서 모든 결과를 얻는 쿼리가 있을 경우 cloth 의 카테고리에서 shirt 의 서브카테고리만의 결과를 얻어야 한다고 합시다. 하지만 tags 필드에 대해서는 어느 필터에서도 조건을 취합하고 싶지는 않습니다. 로컬파라메터의 도움으로 다음과 같은 쿼리를 만들 수 있습니다.

q=flower&fq={!tag=facet_cat}category:cloth&fq={!tag=facet_sub}subcategory:shirt&facet=true&facet.field={!ex=facet_cat,facet_sub}tags

ElasticSearch Faceting Scope and Filters

By default ElasticSearch facets are restricted to the scope of a given query, which is understandable. However, ElasticSearch also lets us change the scope of faceting to global and thus calculate the faceting for the whole data set, and not just for a given result set. In addition to that we can calculate facets for different nested objects by defining the scope matching the name of the nested object. This can come in handy in many situations, for example when optimizing memory usage on faceting on multivalued fields with many unique terms. In addition to that with ElasticSearch we can narrow down the subset of the documents on which faceting will be applied by using filters. We can define filters inside faceting (just please remember that filters that narrow down query results are not restricting faceting) and choose which documents should be taken into consideration when calculating facets. Of course, as you may expect, filters for faceting may be defined in the same way as filters for queries.

기본적으로는 ElasticSearch 의 Faceting 은 주어진 쿼리의 범위에 한정됩니다. 이해할 수는 있지만 ElasticSearch 는 Faceting 의 범위를 글로벌하게 변경가능하고, 주어진 결과집합만이 아닌 모든 데이터셋에 대해서 Faceting 을 계산할 수 있습니다. 그리고 다른 중첩오브젝트에 대해서 Faceting 을 계산할 수도 있습니다. 중첩 오브젝트의 이름에 일치하는 범위를 정의합니다. 이는 대부분의 경우에 편리합니다. 예를 들어 많고 다양한 용어와 여러값을 가진 필드에 대해서 Faceting 을 할 시, 메모리를 최적화해야할 경우입니다. ElasticSearch 에서는 필터를 사용함으로써 Faceting 을 적용할 도큐먼트의 부분집합을 뽑아낼 수 있습니다. Faceting 의 내부에서 필터를 설정가능하고(쿼리 결과를 묶는 필터는 Faceting 만으로 한정되지 않는 것을 기억해주세요) 어느 도큐먼트가 Faceting 을 계산할 시에 고려해야할 것인가를 선택할 수 있습니다. 물론 예상되겠지만, Faceting 에 대한 필터는 쿼리에 대해서 필터처럼 정의할 수 있습니다.

정리

In this part of the Apache Solr vs ElasticSearch posts series we talked about the ability to calculate facet information and only about this. Of course, this is only a look at the surface of faceting, because both Apache Solr and ElasticSearch provide some additional parameters and features that we couldn’t cover without turning this post into a tl;dr monster. However, we hope this post gives you some general ideas about what you can expect from each of these search engines. In the next part of the series we will focus on other search features, such as geospatial search and the administration API. If you are going to the upcoming ApacheCon EU and are interested in hearing more about how ElasticSearch and Apache Solr compare, please come to my talk titled “Battle of the giants: Apache Solr 4.0 vs ElasticSearch“. See you there!

Apache Solr vs. ElasticSearach 시리즈의 이번 파트에서는 Faceting 정보의 계산에 대해서만 이야기했습니다. 물론 이는 Faceting 의 표면만을 다룬 것에 지나지 않습니다. Apache Solr 와 ElasticSearch 양쪽에 좀 더 추가적인 파라메터와 기능이 더 많이 존재하지만 모든 것을 커버하기에는 현재로는 어렵네요. 하지만 이 글이 각 검색엔진에 무엇을 기대할 것인가 하는 일반적인 생각을 얻을 수 있을 것입니다. 다음편에서는 별도의 검색기능인 geospatial 검색과 관리용 API 에 대해서 다룰 예정입니다. 만약 이번 ApacheCon EU 에 오고 ElasticSearch 와 Apache Solr 비교에 대해서 보다 자세히 듣고 싶다면 내 강연(타이틀은 “Battle of the giants : Apache Solr 4.0 vs ElasticSearch”)를 들으러 오세요. 거기서 만나요!

@kucrafal, @sematext

[번역] Solr vs. ElasticSearch - Part 3 : Searching

Jul 16th, 2013

Solr vs ElasticSearch: Part 3 – Searching

October 1, 2012 by Rafał Kuć

In the last two parts of the series we looked at the general architecture and how data can be handled in both Apache Solr 4 (aka SolrCloud) and ElasticSearch and what the language handling capabilities of both enterprise search engines are like. In today’s post we will discuss one of the key parts of any search engine – the ability to match queries to documents and retrieve them.

이 시리즈의 앞의 2회에서 Apache Solr 4 (SolrCloud라고도 불립니다）와 ElasticSearch 의 전체 아키텍쳐와 데이터가 어떻게 다뤄지는 지, 그리고 두 검색엔진의 언어 취급능력이 어떤 지를 알아봤습니다. 이번 글에서는 검색 엔진의 키가 되는 부분 중 하나인 도큐먼트에 대한 쿼리를 매치해서 결과를 받는 능력에 대해서 의논해보도록 합니다.

전체적인 어프로치

Both search engines expose their search APIs via HTTP. If you are not familiar with Solr or ElasticSearch, here are a few simple examples of what Apache Solr and ElasticSearch queries look like:

두 검색엔진 모두 HTTP 경유로 검색 API 를 공개하고 있습니다. Solr 와 ElasticSearch 를 잘 모르는 분을 위해, 여기에서 몇가지 Apache Solr 와 ElasticSearch 의 쿼리가 어떤지를 간단한 예로 보여드립니다.

Solr

curl -XGET 'http://localhost:8983/solr/sematext/select?q=post_date:[2012-09-10T12:00:00Z+TO+2012-09-10T15:00:00Z]'

ElasticSearch

curl -XGET http://localhost:9200/sematext/_search?pretty=true -d '{
    "query" : {
        "range" : {
            "post_date" : {
                "from" : "2012-09-10T12:00:00",
                "to" : "2012-09-10T15:00:00"
            }
        }
    }
}'

As you can see the ElasticSearch query is more structured allowing for more precise control of what you are trying to get – similar to Lucene queries. Solr on the other hand uses a query parser to parse your query out of the textual value of the “q” URL parameter (n.b. you can use query parser in ElasticSearch too). But as one can see on Solr mailing lists, many new users have problems because of such approach – they are overwhelmed with all the options and parameters. At the same time, Solr does make simple queries with boosting and extended dismax parser very easy to do, although that comes at a price. If you want to have a higher degree of control over your query, you are (in most cases) forced to use local params that, while powerful, can be quite hard for users not familiar with its cryptic syntax.

보시는 대로, ElasticSearch의 쿼리는 보다 구조화되어 원하는 결과를 얻기위해 정확한 컨트를을 가능하게 합니다. Lucene 쿼리와 비슷합니다. 반면 Solr 는 쿼리파서를 URL 파라메터의 q 값으로 넣은 텍스트 값을 쿼리로 파싱합니다. (주의: ElasticSearch 에서도 쿼리파서를 이용할 수 있습니다) 하지만, Solr 의 메일링리스트에서 확인할 수 있듯이 많은 유저가 그런 방법으로 인해 문제를 겪고 있습니다. 그들은 모든 옵션과 파라메터에 매우 난감해하고 있습니다. 동시에 Solr 는 간단한 쿼리를 boosting 과 확장된 dismax 파서로 간단하게 실행할 수 있습니다. 하지만 그에 상응하는 부담을 지고 있습니다. 만약 고도의 컨트롤을 쿼리로 할 경우에는 (대부분의 경우) 로컬 파라메터를 사용하지 않으면 안되며, (파워풀하지만) 그 암호같은 문법에 익숙하지 않은 유저에게는 매우 큰일이 아닐 수 없습니다.

To sum up our short introduction – both search engines give you a similar degree of control when it comes to querying, although if you want to create your queries from scratch and control every aspect of them, just as you would when using Lucene directly, ElasticSearch is the way to go, not because Solr doesn’t let you, but the structured JSON way of querying ElasticSearch is a better fit in that case and feels more intuitive.

우리들의 짧은 소개를 정리하면, 두 검색엔진이 같은 정도의 컨트롤을 쿼리로 제공합니다. 그러나 스크래치부터 쿼리를 만들고, Lucene 을 직접 이용하도록 모든 국면을 컨트롤하고 싶은 경우, ElasticSearch 를 선택해야 합니다. Solr 에서는 불가능한 것은 아니지만, ElasticSearch 가 제공하는 구조화된 JSON 이 그런 케이스에 보다 안성맞춤이고 보다 직감적으로 느낄 수 있기 때문입니다.

전문검색

In this section we try to compare search capabilities of both both Apache Solr and ElasticSearch. This is by no means a comprehensive tutorial of all the features that both search engines expose, but rather a simple comparison of similarities and difference of them.

이 섹션에서는 Apache Solr 와 ElasticSearch 모두 검색능력을 비교해 봅니다. 물론 포괄적으로 두 검색엔진이 가지는 모든 기능의 튜토리얼이 아닙니다. 콕 찝어서 말하자면 간단한 유사점과 차이점의 비교가 아닐까 하네요.

검색

Of course, both Apache Solr and ElasticSearch enable you to run standard queries such as Boolean queries, phrase queries, fuzzy queries, wildcard queries, etc. You can combine them into multiple Boolean phrases using Boolean operators. In addition to that, both engine let one specify query-time boosts and control how score is calculated during search execution.

물론 Apahe Solr 와 ElasticSearch 양쪽 모두 표준적인 검색이 가능합니다. 불리언 쿼리, 프레이즈 쿼리, 퍼지 쿼리, 와일드 카드 쿼리 같은 것들 말이죠. 불리언연산자를 사용해서 이것들을 결합해서 여러 불리언 프레이즈에 결합할 수도 있습니다. 물론 두 검색엔진은 쿼리 마다 boost 를 지정하고, 검색실행 중에 어떻게 스코어가 계산되는 지를 컨트롤할 수 있습니다.

Span Queries

If you are not familiar with span queries here is a one-sentence description: Lucene provides span queries in order to enable searching documents with position requirements, but not necessarily appearing one after another like in the phrase query. And now the comparison:

스팬쿼리를 모르는 사람을 위해 간단하게 설명하자면, Lucene 이 제공하는 스팬쿼리는 위치를 필요로하는 도큐먼트 검색을 가능하게 합니다. 그러나 프레이즈쿼리처럼 연속해서 나올 필요는 없습니다. 한번 비교해볼까요?

Solr

Update: As Erik Hatcher noticed support for span queries is already there in Apache Solr (SOLR-2703). We can use span queries by using the surround query parser.

업데이트： Erik Hatcher에 의해서 스팬쿼리는 이미 Apache Solr 에 들어갔습니다.(SOLR-2703) 쿼리파서로 감싸서 스팬쿼리를 사용할 수 있습니다.

ElasticSearch

ElasticSearch has the support for Lucene SpanNearQuery, SpanFirstQuery, SpanTermQuery, SpanOrQuery and SpanNotQuery. With these queries you can construct different span queries similar to what you can do with Lucene.

ElasticSearch는 Lucene 의 SpanNearQuery, SpanFristQuery, SpanTermQuery, SpanOrQuery, 그리고 SpanNotQuery를 지원합니다. 이 쿼리들을 사용해서 Lucene 에서 이뤄지는 다양한 스팬쿼리를 구성할 수 있습니다.

More Like This

“More like this” (aka MLT) functionality lets you to get documents similar to a given query according to some assumptions and parameters used to find documents that are similar to one another. Both search engines have the ability to run MLT queries. In Solr, MLT query is implemented as a search component. On the other hand there is ElasticSearch where more like this is just another type of query one can construct using JSON. When comparing parameters available in both search servers it seems that ElasticSearch provides slightly more control over more like this functionality with features like specifying a set of words that shouldn’t be taken into consideration and the percentage of terms to match on.

“More like this”(또는 MLT) 란, 쿼리에서 부여된 도큐먼트에 몇가지 파라메터를 이용해 비슷한 도큐먼트를 얻을 수 있는 기능입니다. 차이가 있는 도큐먼트를 찾기 위해서 파라메터가 사용됩니다. 두 검색엔진 모두 MLT 쿼리를 사용할 수 있습니다. Solr 에서는 MLT 쿼리는 검색 컴포넌트로 구현되어 있습니다. 반면, ElasticSearch 에서는 JSON 을 사용해서 구축하는 한 종류의 쿼리가 됩니다. 두 검색 서버에 존재하는 파라메터를 비교하면 ElasticSearch 는 조금 더 많은 컨트롤을 제공합니다. 예를들어 고려하지 않은 단어의 집합을 지정하거나, 매치하는 항목의 분할을 지정할 수 있습니다.

Did You Mean

“Did you mean” (aka DYM) functionality makes it possible to correct users’ query typos and spelling mistakes and suggest corrected queries. For example, for a misspelled phrase “saerch problems” our Researcher module on http://search-lucene.com (which is a kind of a did you mean module) works like this:

“Did you mean”（또는 DYM） 이란 유저의 쿼리의 타이핑오류나 스펠링오류를 고치고 정정된 쿼리를 제안하는 기능입니다. 예를들어 “saerch problems” 라는 잘못된 스펠 구문에서 http://search-lucene.com 에 존재하는 우리 연구자 모듈(Did you mean 모듈과 같은 것입니다) 은 이런 동작을 하게 됩니다.

Let’s see what Solr and ElasticSearch have to offer here.

Solr 와 ElasticSearch 에 어떤 기능이 제공되는지 확인해봅시다.

Solr

Solr exposes spell check component API, which is built on top of Lucene spell checker module. Before Solr 4.0 the spell checker required its own index that, while built automatically by Solr, was another moving piece and potential inconvenience. Now there is a DirectSolrSpellchecker implementation available which can give spell checker suggestion based on the index you are using for search instead of relying on the side-car spell checker index. Solr spell checker supports distributed search and has numerous parameters which allow control over its behavior, like number of suggestion, collation properties, accuracy, etc.

Solr 는 스펠링체크 컴포넌트 API 를 제공합니다. 이는 Lucene 의 스펠체커 모듈 위에 구축되어 있습니다. Solr 4.0 이전에는 스펠체커 자체가 인덱스를 필요로 하고, Solr 에 의해 자동적으로 구축되었지만, 별도로 동작하는 부품으로는 조금 불편한 것이었습니다. 지금은 DirectSolrSpellchecker 구현이 존재하며, 검색에 사용하고 있는 인덱스에 기초하여 스펠체커가 제안할 수 있습니다. 스펠체커용의 인덱스를 필요로하지 않습니다. Solr 의 스펠체커는 분산검색을 지원하고, 많은 파라메터를 가진 것으로 그 동작을 컨트롤할 수 있습니다. 예를들어 제안 수나 조합 속성이나 정확성등입니다.

ElasticSearch

Unfortunately, ElasticSearch doesn’t offer did you mean functionality out of the box. There is issue #911 currently open, so we can expect that module in one of the future releases. Although we’ll be talking about plug-ins in the last part of the Solr vs ElasticSearch series, if you need did you mean functionality in ElasticSearch you can use the Suggest Plugin developed by @spinscale (https://github.com/spinscale/elasticsearch-suggest-plugin).

아쉽게도 ElasticSearch는 “Did you mean” 기능을 그대로 제공하지 않습니다. Issue#911이 현재 오픈되어 있고, 앞으로의 릴리즈에서 그 모듈을 기대할 수 있겠습니다. Solr ElasticSearch 의 마지막 파트에서 플러그인에 대해서 다룰 예정이지만, 만약 “Did you mean” 기능을 ElasticSearch 에서 필요로 한다면 @spinscale 이 개발한 Suggest 플러그인을 이용할 수 있습니다.(https://github.com/spinscale/elasticsearch-suggest-plugin).

Nested Queries

As we already wrote, ElasticSearch supports indexing of nested document which Solr doesn’t support. In order to query nested documents ElasticSearch exposes nested query type. This query is run against nested documents, but as the result we get the root documents. In addition to that, you can also set how scoring of the root document is affected.

이미 다룬 적이 있지만, ElasticSearch 는 중첩 도큐먼트의 인덱싱을 지원합니다. Solr 는 지원하지 않습니다. 중첩도큐먼트를 쿼리하기 위해서는 ElasticSearch 는 중첩쿼리타입을 공개하고 있습니다. 이 쿼리는 중첩도큐먼트에 대해서 실행되지만 결과로 루트 도큐먼트를 얻게 됩니다. 그리고 루트도큐먼트의 스코어링에 어떻게 영향받는지도 설정할 수 있습니다.

Parent – Child Relationship Queries

Solr

In Apache Solr there is no functionality called parent – child, instead of that we have the possibility to use joins. Solr joins are specified in local params format and look like this:

Apache Solr 에서는 부모자식이라고 불리는 기능은 없습니다. 대신에 join 을 사용하는 방법이 있습니다. Solr 의 join 은 로컬파라메터 형식으로 지정되어 아래처럼 사용할 수 있습니다.

q={!join from=parent to=id}color:Yellow

The above query says that we want to get all parent documents that have child documents that have the Yellow term in the color field. The join should be done on parent field in the children to the id field in the parent document.

위 쿼리에서 color 필드에 Yellow 를 가진 자식 도큐먼트를 가진 모든 부모 도큐먼트를 얻습니다. join 은 자식 도큐먼트의 parent 필드에서 부모 도큐먼트의 id필드상에서 이뤄집니다.

ElasticSearch

ElasticSearch lets you use two type of queries – has_children and top_children queries to operate on child documents. The first query accepts a query expressed in ElasticSearch Query DSL as well as the child type and it results in all parent documents that have children matching the given query. The second type of query is run against a set number of children documents and then they are aggregated into parent documents. We are also allowed to choose score calculation for the second query type.

ElasticSearch는 두가지 타입의 쿼리를 사용할 수 있습니다. has_children 과 top_children 쿼리는 자식 도큐먼트 위에서 명령됩니다. 최초의 쿼리는 ElasticSearch Query DSL 또는 child타입으로 표현된 쿼리를 받아서, 부여된 쿼리에 매치하는 도큐먼트를 가지는 모든 부모 도큐먼트를 결과로 합니다. 두번째 타입의 쿼리는 몇가지 자식 도큐먼트의 집합에 대해서 실행되고 그것들은 부모 도큐먼트 안에서 집약됩니다. 두번째의 쿼리타입에 대해서도 스코어 연산을 선택할 수 있습니다.

Filtering And Caching Control

Solr

Of course Solr lets you to narrow results of your query execution with filters. You can filter documents based on a single value, Boolean expression, query, field existence, geographical location and many, many more. In addition to that you can use local params and construct complicated queries like:

물론 Solr 는 쿼리의 실행결과를 필터로 제한할 수 있습니다. 단일 값, 불리언식, 필드의 존재, 지리상의 위치, 그리고 가장 많은 것을 기준으로 해서 도큐먼트를 필터링할 수 있습니다. 그리고 로컬파라메터를 사용해서 복잡한 쿼리를 구성할 수 있습니다.

예:

fq={!frange l=10 u=30}if(exists(promotionPrice),sum(promotionPrice,dailyPrice),sum(price,dailyPrice))

ElasticSearch

ElasticSearch, similar to Solr, lets you use many filter types, which are similar to filters, so we’ll skip mentioning them all. However, in addition to similarities with Solr, there are also some differences like supports for filters run against nested documents and children documents. ElasticSearch can also use scripts to filter documents with the script filter.

ElasticSearch는 Solr 와 비슷하게 많은 필터타입을 사용할 수 있습니다. 필터와 마찬가지이기에 따로 말할 필요도 없습니다. 그러나 Solr 에 대한 유사성 이외에 몇가지 차이점이 존재합니다. 예를들어 중첩 도큐먼트나 자식 도큐먼트에 대해 실행하는 필터를 지원하는 것입니다. 또한 ElasticSearch 는 스크립트 필터를 사용해서 도큐먼트의 필터링에 스크립트를 사용할 수도 있습니다.

Filter Cache Control

Both ElasticSearch and Apache Solr can control if the filter should or shouldn’t be cached, but in addition to that Solr lets you control the order of filters execution (the non cached ones). Its a great feature of Solr, because if you know that one of your filters is a performance killer, you can set its execution after all other filters and that way it’ll only work on the subset of the original result set.

ElasticSearch 와 Apache Solr 양쪽 모두 필터가 캐쉬되어야하는지 그렇지 않는 지를 설정할 수 있습니다. Solr 는 필터의 실행순서를 조정할 수 있습니다(캐쉬되지 않은 것만). 이는 Solr 의 뛰어난 기능으로 만약 당신이 필터 하나가 퍼포먼스에 나쁜 영향을 주는 것을 이미 알고 있다면 그 필터의 실행의 순서를 뒤로 미루도록 설정할 수 있습니다. 그렇게 함으로써 그 필터는 오리지널의 실행결과의 부분집합만을 대상으로 실행됩니다.

Score Calculation Control

In both engines we are more or less allowed to control how scores for documents are calculated. In Solr this is mostly done by using function queries and different boosts and queries made using local params. In ElasticSearch we can use different query types which allow us to give specific scores to some of the documents (for example ones matching a certain filter) or calculate score on the basis of used script.

두 엔진에 있어서 많든 적든 도큐먼트에 대해서 스코어가 어떻게 계산할 것인가를 컨트롤할 수 있습니다. Solr 에서는 대부분의 경우에 있어서 Function Queries 가 복수의 boost 를 사용해서 이뤄지고, 쿼리는 로컬파라메터를 사용해서 만들어집니다. ElasticSearch 에서는 다른 쿼리타입을 사용해서 어떤 도큐먼트에 대해서 특정 스코어를 줄 수도 있습니다(예를들어 몇가지 필터에 매치하는 도큐먼트). 또는 사용한 스크립트에 따라 스코어를 계산할 수도 있습니다.

Real Time Get

Real time get allows us to retrieve a document using its identifier as soon as it was sent for indexing even if it hasn’t yet been hard committed. Both ElasticSearch and Apache Solr return the newest document, even if it wasn’t indexed. But lets go into specifics.

Real Time Get 은 도큐먼트 ID 를 사용해서 인덱싱을 위해서 보내졌을 때, 설령 하드커밋이 이뤄지지 않았어도 도큐먼트 회수를 가능하게 합니다. ElasticSearch 와 Apache Solr 모두 최신 도큐먼트를 인덱스되지 않아도 반환 할 수 있습니다. 이에 대해서 좀 더 자세히 볼까요.

Solr

Introduction of so called transaction log in Solr 4.0 allowed for the real time get functionality. Basically, the real time get looks for the newest version of the document in the transaction log first and returns it as a result of such call (if it is found, of course). If it is not found the real time get handler gets the document using the latest opened searcher available. Keep in mind that in order to return the newest version of the document in near real time manner Solr doesn’t need to reopen the index after indexing, so this functionality is useful even if you don’t reopen your searcher every second.

Solr4.0에서 트랜잭션 로그라고 불리는 기능으로 Real Time Get 을 이용할 수 있게 되었습니다. 기본적으로는 Real Time Get 은 도큐먼트의 최신 버젼을 트랜잭션 로그에서 맨처음 찾아서 결과로 반환합니다(물론 찾은 경우에). 혹시라도 못 찾았다면 Real Time Get 핸들러는 가장 최근에 열린 searcher 가 존재하는 도큐먼트를 반환합니다. 기억해두었으면 하는 것은 거의 리얼타임 의 최신 버젼의 도큐먼트를 반환하기 위해 Solr 는 인덱싱 뒤에 인덱스를 다시 열 필요는 없습니다. 따라서 이 기능은 searcher 를 매초 다시 열지 않아도 유효합니다.

ElasticSearch

ElasticSearch also uses transaction log and because of that the real time get is not affected by the refresh rate of your indices. In addition to returning the document itself ElasticSearch exposes a few other API parameter that allow you to specify if the request should go to the primary or local shard (or even a custom one). You can also use routing with real time get to route the request to one specific shard if you know which shard should have the appropriate document. The real time get API of ElasticSearch also allows to check if the document exists using HTTP HEAD method, for example:

ElasticSearch 도 트랜잭션로그를 사용하기 때문에 Real Time Get 은 인덱스의 리프레쉬율에 영향을 받지 않습니다. 도큐먼트를 반환하기 윙해 ElasticSearch 는 몇가지 다른 API 파라메터를 공개하고 그것을 사용 함으로써 질의가 프라이머리인지 또는 로컬의 Shard 로 (또는 커스텀 Shard 에만) 보낼지에 대한 것을 지정할 수 있습니다. 또 라우팅을 Real Time Get 과 함께 사용함으로 어느 shard 가 적절하게 도큐먼트를 가지고 있는지 알고 있는 경우에 질의를 하나의 특정 shard 로 유도할 수 있습니다. ElasticSearch 의 Real Time Get API 는 도큐먼트가 존재하는지 어떤지를 HTTP HEAD 메소드로 확인할 수 있습니다.

curl -XHEAD 'http://localhost:9200/sematext/blog/123456789'

Aliasing

One of the things introduced in Apache Solr 4.0 and no available in ElasticSearch right now is the ability to transform result documents. First of all Solr allows you to alias returned fields, so for example you can return field price_usd or price_eur as price depending on your needs. The second thing is the ability to return values returned by functions as a (pseudo) field in the result (or fields). Solr also has the ability to return fields which start with a given prefix (for example all fields starting with price). Apart from the ability to get a function value as a field added to matched documents on the fly other functionalities are not ground breaking, though they can be handy in some cases.

Apache Solr 4.0 에서 소개된 ElasticSearch 에 현시점에서 아직 존재하지 않는 기능으로 결과의 도큐먼트 변환이 있습니다. 맨처음 Solr 는 반환값의 필드에 별명을 붙일 수 있습니다. 그 때문에 필요에 따라서 price 필드로 price_usd 나 price_eur 을 반환할 수 있습니다. 두번째로 함수의 반환값을 (겉보기의) 필드로써 결과(또는 필드)로 반환할 수 있습니다. Solr 는 부여된 접두어로 시작하는 필드(예를들어 price 로 시작하는 모든 필드)를 반환하는 기능도 있습니다.

그 외

One of the things we always mention when talking about the differences between Apache Solr and ElasticSearch, at least when it comes to query handling, is the possibility of specifying the analyzer during query time. But lets start from the beginning. In Solr, you have to create the schema.xml file which holds the information about the index structure as well as query and index-time analyzers for fields. Similarly, in ElasticSearch, you can create mappings and define analyzers. At query-time Solr will choose the right analyzer for each field and use it. ElasticSearch will do the same with one major difference. In ElasticSearch you can change the analyzer and specify the analyzer you want to use for analysis at query-time. For example, this is very useful when you know the language of the query because then you can choose the most language-appropriate analyzer on the fly. We have made use of this in combination with our Language Identifier.

Apache Solr 와 ElasticSearch 의 차이에 대해서 우리들이 항상 전달하고 싶은 것중 하나는, 적어도 쿼리이 취급에 관해서는 쿼리하는 도중에 해석기를 지정할 가능성입니다. Solr 에서는 인덱스구조, 거기에 쿼리와 필드의 인덱싱의 해석기에 관한 정보를 가지는 schema.xml 파일을 만들 필요가 있습니다. ElasticSearch 에서는 맵핑을 작성해서 해석기를 정의합니다. 쿼리시에 Solr 는 각 필드에 대해서 적절한 해석기를 선택해서 사용합니다. ElasticSearch 도 같은 일을 하지만 큰 차이가 하나 있습니다. ElasticSearch 에서는 해석기 변경과 해석에 필요한 해석기의 지정이 쿼리시에 이뤄집니다. 예를 들어 쿼리대상의 언어를 알고 있는 경우 매우 편리하겠죠. 언어에 가장 적절한 해석기를 동적으로 선택할 수 있기 때문입니다. 우리들은 이 기능을 직접 만든 언어특정기와 함께 사용하고 있습니다.。

정리

As you can see, both ElasticSearch and Apache Solr expose lots of functionality when it comes to handling your search queries, and we barely scratched the surface here. Of course, each of them has some features that the other one doesn’t have, but Solr and ElasticSearch are competing for mind and market share, and are both rapidly evolving and improving, so we can expect more features from both of them in the future. In the next, fourth part of the series we will concentrate on the faceting capabilities of Apache Solr and ElasticSearch. Stay tuned. In the mean time, you can follow @sematext and tell us what you want us to cover.

보시는 대로, ElasticSearch 와 Apache Solr 모두 많은 기능을 공개하고 있습니다. 그리고 이 글에서는 겉을 살짝 핥은 것에 지나지 않습니다. 물론 각각 다른 쪽이 가지지 않은 기능을 가지고 있고, Solr 와 ElasticSearch 는 사고방식과 시장 마켓셰어 에서 경쟁을 하고 있습니다. 그리고 양쪽모두 급속하게 진화하고 있지요. 따라서 앞으로 더욱 많은 기능을 기대할 수 있을 것입니다.

다음편에서는 Apache Solr 와 ElasticSearch 의 Faceting 기능에 대해서 다루도록 합니다. 이어서 계속 지켜봐주세요. 그때까지는 @sematext 를 팔로해서 무엇을 다뤄주었으면 좋은 지 알려주세요.

[번역] Solr vs. ElasticSearch: Part 2 - Indexing and Language Handling

Jul 16th, 2013

Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling

September 4, 2012 by Rafał Kuć

In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.

전 편의 “Solr vs. ElasticSearch” Part1 기사에서 Apache Lucene 을 기반으로 하는 두가지 검색엔진의 전반적인 아키텍쳐에 대해서 이야기 했습니다. 오늘은 데이터의 취급과 인덱싱, 그리고 언어의 해석능력을 알아보도록 합니다.

데이터 인덱싱

Apart from using Java API exposed both by ElasticSearch and Apache Solr, you can index data using an HTTP call. To index data in ElasticSearch you need to prepare your data in JSON format. Solr also allows that, but in addition to that, it lets you to use other formats like the default XML or CSV. Importantly, indexing data in different formats has different performance characteristics, but that comes with some limitations. For example, indexing documents in CSV format is considered to be the fastest, but you can’t use field value boosting while using that format. Of course, one will usually use some kind of a library or Java API to index data as one doesn’t typically store data in a way that allows indexing of data straight into the search engine (at least in most cases that’s true).

ElasticSearch와 Apache Solr 모두 공개되어 있는 Java API 가 아닌 HTTP 호출을 사용해서 데이터 인덱싱을 할 수 있습니다. ElasticSearch 에서 데이터를 인덱싱하기 위해서는 데이터를 JSON 포맷으로 준비할 필요가 있습니다. Solr 도 가능하지만, 그에 덧붙여 기본적으로는 XML 이나 CSV 등 다른 형식을 이용할 수 있습니다. 중요한 것은 다른 형식의 데이터를 인덱싱 하는 경우 다른 퍼포먼스 특성이 존재하며 이에 대해서는 몇 가지 제약이 따릅니다. 예를 들어 CSV 형식의 도큐먼트를 인덱싱하는 경우가 가장 빠르다고 볼 수 있는 데, 이 경우 필드값의 boosting 을 이용할 수 없습니다. 물론 일반적으로는 어떠한 종류의 라이브러리나 Java API 를 사용해서 데이터를 인덱싱할 것입니다. 일반적으로는 검색 엔진에 직접 들어가는 데이터의 인덱스를 만들면서 데이터를 저장하는 일은 없습니다. (적어도 대부분의 경우 그렇습니다)

ElasticSearch에 대해서 보충설명

It is worth noting that ElasticSearch supports two additional things, that Solr does not – nested documents and multiple document types inside a single index.

Solr 가 지원하지 않는 ElasticSearch 의 두가지 추가기능에 대해서 다뤄볼 가치가 있는 데 그것은 중첩 도큐먼트와 단일 도큐먼 안에서의 복수도큐먼트 타입니다.

The nested documents functionality lets you create more than a flat document structure. For example, imagine you index documents that are bound to some group of users. In addition to document contents, you would like to store which users can access that document. And this is were we run into a little problem – this data changes over time. If you were to store document content and users inside a single index document, you would have to reindex the whole document every time the list of users who can access it changes in any way. Luckily, with ElasticSearch you don’t have to do that – you can use nested document types and then use appropriate queries for matching. In this example, a nested document would hold a lists of users with document access rights. Internally, nested documents are indexed as separate index documents stored inside the same index. ElasticSearch ensures they are indexed in a way that allows it to use fast join operations to get them. In addition to that, these documents are not shown when using standard queries and you have to use nested query to get them, a very handy feature.

중첩 도큐먼트 기능은 평탄한 도큐먼트 구조를 넘은 것을 만들 수 있습니다. 이를테면 어떤 유저 그룹에 연결된 도큐먼트를 인덱싱한다고 생각해봅시다. 도큐먼트의 내용 뿐만 아니라 어느 유저가 그 문서에 접근할 수 있는 지를 넣고 싶다고 해둡니다. 만약 단일 인스턴스의 도큐먼트 안에서 도큐먼트의 내용과 유저를 넣을 경우, 그 도큐먼트에 접근할 수 있는 유저의 리스트를 변경할 때마다 도큐먼트 전체를 다시 인덱싱해야 합니다. 다행히도 ElasticSearch 를 사용하면 그런 일은 없습니다. 중첩 도큐먼트 타입을 사용해서 적절한 쿼리를 매칭에 사용할 수 있습니다. 이 예제에서는 중첩 도큐먼트는 유저 리스트와 도큐먼트의 접근권한을 가집니다. 내부적으로는 중첩 도큐먼트는 같은 인덱스 안에서 놓여진 분리된 인덱스 문서로 색인됩니다.

ElasticSearch는 빠른 JOIN 명령을 사용해서 인덱싱할 수 있습니다. 덧붙여 이 도큐먼트들은 표준 쿼리를 사용한 경우는 보이지 않으며, 중첩 쿼리를 얻어내기 위해서는 사용할 수 있어 매우 편리한 기능입니다.

Multiple types of documents per index allow just what the name says – you can index different types of documents inside the same index. This is not possible with Solr, as you have only one schema in Solr per core. In ElasticSearch you can filter, query, or facet on document types. You can make queries against all document types or just choose a single document type (both with Java API and REST).

인덱스마다 복수타입의 도큐먼트는 이름 그대로의 행위를 할 수 있습니다.. 다른 타입의 도큐먼트를 같은 인덱스 안에서 인덱싱할 수 있는데, 이것은 Solr 에서는 코어당 하나의 스키마만을 가질 수 있기 때문에 불가능합니다.

ElasticSearch에서는 도큐먼트타입에 의해 필터, 쿼리, 퍼셋(Facet) 을 할 수 있습니다. 모든 도큐먼트 타입에 대해서, 또는 하나의 도큐먼트 타입을 선택해서 쿼리를 날릴 수 있습니다. (Java API 와 REST 양쪽 모두 가능합니다)

인덱스 다루기

Let’s look at the ability to manage your indices/collections using the HTTP API of both Apache Solr and ElasticSearch.

Apache Solr 와 ElasticSearch 에서의 인덱스/콜렉션을 HTTP API 를 사용해서 관리하는 방법을 보도록 합니다.

Solr

Solr let’s you control all cores that live inside your cluster with the CoreAdmin API – you can create cores, rename, reload, or even merge them into another core. In addition to the CoreAdmin API Solr enables you to use the collections API to create, delete or reload a collection. The collections API uses CoreAdmin API under the hood, but it’s a simpler way to control your collections. Remember that you need to have your configuration pushed into ZooKeeper ensemble in order to create a collection with a new configuration.

Solr 는 운용중인 클래스 안에서 모든 코어를 CoreAdmin API 로 관리할 수 있습니다. 코어 작성, 이름 변경, 리로드, 또는 여러 코어들을 하나의 코어로 머지할 수 있습니다. 또한 Solr 는 Collections API 를 사용해서 콜렉션의 작성, 삭제, 그리고 리로드를 할 수 있습니다. Collections API 는 뒷단에서 CoreAdmin API 를 사용하며 간단하게 콜렉션을 다룰 수 있습니다. 새로운 설정의 콜렉션을 작성하기 위해서는 ZooKeeper ensemble 에 설정을 넣어 둘 필요가 있습니다.

When it comes to Solr, there is additional functionality that is in early stages of work, although it’s functional – the ability to split your shards. After applying the patch available in SOLR-3755 you can use a SPLIT action to split your index and write it to two separate cores. If you look at the mentioned JIRA issue, you’ll see that once this is commited Solr will have the ability not only to create new replicas, but also to dynamically re-shard the indices. This is huge!

Solr 에는 아직 미성숙하지만 유효한 추가기능이 있습니다. 바로 Shard 의 분할입니다. SOLR-3755에 있는 패치를 적용한 다음에 인덱스를 분할해서 두개로 분할한 코어에 넣는 것이 SPLIT 액션을 사용해서 가능해집니다. 그 JIRA 이슈를 읽으면 이것이 커밋되면 Solr 는 새로운 레플리카를 작서아는 것뿐만 아니라, 동적으로 인덱스를 re-shard 하는 기능을 가지게 됩니다. 이것은 매우 큰 이점입니다!

ElasticSearch

One of the great things in ElasticSearch is the ability to control your indices using HTTP API. We will take about it extensively in the last part of the series, but I have to mention it ere, too. In ElasticSearch you can create indices on the live cluster and delete them. During creation you can specify the number of shards an index should have and you can decrease and increase the number of replicas without anything more than a single API call. You cannot change the number of shards yet. Of course, you can also define mappings and analyzers during index creation, so you have all the control you need to index a new type of data into you cluster.

ElasticSearch의 굉장한 기능 중 하나로 HTTP API를 사용해서 인덱스를 다루는 능력이 있습니다. 이 시리즈의 마지막 파트에서 이에 대해서 광범위하게 다룰 예정이지만, 일단 여기에서도 간단히 다룰 필요가 있을 것 같습니다. ElasticSearch에서는 운용중의 클러스터 위에서 인덱싱을 할 수 있고, 삭제할 수 있습니다. 인덱싱 하는 동안, 인덱스를 가져야할 Shard 의 수를 지정할 수 있으며, 단일 API 호출만으로 레플리카 수의 증감을 조정할 수 있습니다. Shard 수는 아직 변경할 수 없습니다. 물론 맵핑과 해석기를 인덱싱 시에 설정가능하고, 클러스터 안에서 새로운 형태의 데이터 인덱싱에 필요한 모든 컨트롤을 가지게 됩니다.

도큐먼트의 부분갱신

Both search engines support partial document update. This is not the true partial document update that everyone has been after for years – this is really just normal document reindexing, but performed on the search engine side, so it feels like a real update.

두 검색엔진 모두 도큐먼트의 부분갱신을 지원합니다. 이것은 모두가 기대한 그런 도큐먼트 부분갱신이 아닙니다. 단지 일반적인 도큐먼트 의 재인덱싱일 뿐입니다. 그러나 검색엔진에서 실행되기에 진짜 갱신처럼 보이기도 합니다.

Solr

Let’s start from the requirements – because this functionality reconstructs the document on the server side you need to have your fields set as stored and you have to have the version field available in your index structure. Then you can update a document with a simple API call, for example:

요건부터 시작하겠습니다. 이 기능은 서버측의 도큐먼트를 재구성하기에 필드의 집합이 격납되었고 version필드가 인덱스 구성에서 존재하지 않으면 안됩니다. 이렇게 하면 단일 API 호출에 의해 도큐먼트를 갱신할 수 있습니다.

예：

curl 'localhost:8983/solr/update' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'

ElasticSearch

In case of ElasticSearch you need to have the source field enabled for the partial update functionality to work. This source is a special ElasticSearch field that stores the original JSON document. Theis functionality doesn’t have add/set/delete command, but instead lets you use script to modify a document. For example, the following command updates the same document that we updated with the above Solr request:

ElasticSearch 의 경우, 부분갱신 기능이 제대로 동작하기 위해서는 source 필드가 허가되어 있을 필요가 있습니다. 이 source는 특별한 ElasticSearch 필드에서 오리지널 JSON 도큐먼트를 저장합니다. 이 기능에는 add/set/delete 명령은 존재하지 않습니다. 그 대신에 도큐먼트를 변경하는 스크립트의 사용을 허가합니다. 예를 들어 다음의 명령어를 위의 Solr 리퀘스트에서 갱신한 것과 같은 문서를 갱신한다고 합니다.

curl -XPOST 'localhost:9200/sematext/doc/1/_update'-d '{
    "script" : "ctx._source.price = price",
    "params" : {
        "price" : 100
    }
}'

다국어데이터의 취급

As we mentioned previously, and as you probably know, both ElasticSearch and Solr use Apache Lucene to index and search data. But, of course, each search engine has its own Java implementation that interacts with Lucene. This is also the case when it comes to language handling. Apache Solr 4.0 beta has the advantage over ElasticSearch because it can handle more languages out of the box. For example, my native language Polish is supported by Solr out of the box (with two different filters for stemming), but not by ElasticSearch. On the other hand, there are many plugins for ElasticSearch that enable support for languages not supported by default, though still not as many as we can find supported in Solr out of the box. It’s also worth mentioning there are commercial analyzers that plug into Solr (and Lucene), but none that we are aware of work with ElasticSearch…. yet.

이전 언급한 대로, ElasticSearch와 Solr 는 Apache Lucene 을 인덱싱 및 데이터의 검색에 이용하고 있습니다. 각 검색엔진은 그 자체가 Java 구현을 가지며 Lucene 과 상호작용합니다. 이것이 여러 나라의 언어취급에 있어서도 마찬가지 입니다. 이 부분에 있어서는 Apache Solr 4.0β가 ElasticSearch 에 대해서 우위입니다. Solr 그 자체로도 많은 언어를 다룰 수 있기 때문입니다. 예를 들어 저자의 네이티브 언어인 폴란드어는 Solr 에서는 처음부터 지원되었습니다. (두가지 다른 stemming용 필터를 포함한 채) 그러나 ElasticSearch는 그렇지 않습니다. ElasticSearch에서는 기본적으로는 지원하지 않는 언어를 지원하는 플러그인이 다수 존재합니다. 그래도 그 수는 Solr 가 처음부터 지원하고 있는 수만큼은 아닙니다. 또 하나 알아둘 가치가 있는 것은 Solr(와 Lucene)에는 상업해석기가 있습니다. 그러나 ElasticSearch 용은 아직 있는 지 모르겠습니다.

지원되는 자연언어

For the full list of languages supported by those two search engine please refer to the following pages:

두 검색엔진에서 지원되는 언어의 완전한 리스트에 대해서는 다음 버젼을 참고해주세요.

Apache Solr

<a href="http://wiki.apache.org/solr/LanguageAnalysis" >http://wiki.apache.org/solr/LanguageAnalysis</a>

ElasticSearch

Analyzers: http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html Stemming: http://www.elasticsearch.org/guide/reference/index-modules/analysis/stemmer-tokenfilter.html, http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-tokenfilter.html 그리고 http://www.elasticsearch.org/guide/reference/index-modules/analysis/kstem-tokenfilter.html

해석체인의 정의

Of course, both Apache Solr and ElasticSearch allow you to define a custom analysis chain by specifying your own analyzer/tokenizer and list of filters that should be used to process your data. However, the difference between ElasticSearch and Solr is not only in the list of supported languages. ElasticSearch allows one to specify the analyzer per document and per query. So, if you need to use a different analyzer for each document in the index you can do that in ElasticSearch. The same applies to queries – each query can use a different analyzer.

물론, Apache Solr 와 ElasticSearch 양쪽 모두 커스텀 해석기 체인을 정의할 수 있습니다. 당신의 데이터 처리에 사용되는 해석기 및 토크나이저, 필터의 리스트를 지정합니다. 그러나, ElasticSearch 와 Solr 의 차이는 지원되는 언어의 리스트 뿐만이 아닙니다. ElasticSearch 는 도큐먼트 마다, 쿼리마다 해석기를 지정할 수 있습니다. 때문에 인덱스 안의 각각의 도큐먼트에 다른 해석기의 사용이 필요하다면 ElasticSearch 에서는 가능합니다. 쿼리도 마찬가지입니다. 각 쿼리는 다른 해석기를 사용할 수 있습니다.

결과의 그룹핑

One of the most requested features for Apache Solr was result grouping. It was highly anticipated for Solr and it is still anticipated for ElasticSearch, which doesn’t yet have field grouping as of this writing. You can see the number of +1 votes in the following issue: https://github.com/elasticsearch/elasticsearch/issues/256. You can expect grouping to be supported in ElasticSearch after changes introduced in 0.20. If you are not familiar with results grouping – it allows you to group results based on the value of a field, value of a query, or a function and return matching documents as groups. You can imagine grouping results of restaurants on the value of the city field and returning only five restaurants for each city. A feature like this may be handy in some situations. Currently, for the search engines we are talking about, only Apache Solr supports results grouping out of the box.

Apache Solr 에 가장 많이 요청된 기능 중 하나가 결과의 그룹핑입니다. Solr 에 있어서 가장 기대되어 온 기능이며, ElasticSearch 에 있어서도 지금도 기대를 머금고 있습니다. 이 글을 쓰고 있는 시점에서는 ElasticSearch 는 필드의 그룹핑을 가지고 있지 않습니다. 다음 이슈를 보면 이에 대한 여러 요청이 있는 것을 알 수 있을 것입니다. https://github.com/elasticsearch/elasticsearch/issues/256

ElasticSearch에서는 0.20 이 릴리즈 된 때에 지원될 것이라고 기대하고 있습니다. 결과의 그룹핑을 모르는 사람을 위해 설명하자면, 필드나 쿼리 또는 함수의 값에 의해 결과를 그룹으로 나눌 수 있습니다. 그리고 매치된 도큐먼트를 그룹으로 반환합니다. 예를들어 레스토랑의 결과를 마을의 필드로 그룹핑해서 각 거리의 다섯가지 레스토랑만 반환합니다. 이런 기능은 여러 상황에서 편리하겠죠. 지금은 Apache Solr 만이 기본적으로 결과의 그룹핑을 할 수 있습니다.

Prospective Search

One thing Apache Solr completely lacks when comparing to ElasticSearch is functionality called Percolator in ElasticSearch. Imagine a search engine that, instead of storing documents in the index, stores queries and lets you check which stored/indexed queries match each new document being indexed. Sound handy, right? For example, this is useful when people want to watch out for any new documents (think Social Media, News, etc.) matching their topics of interest, as described through queries. This functionality is also called Prospective Search, some call it Pub-Sub as well as Stored Searches. At Sematext we’ve implemented this a few times for our clients using Solr, but ElasticSearch has this functionality built-in. If you want to know more about ElasticSearch Percolator see http://www.elasticsearch.org/blog/2011/02/08/percolator.html.

Apache Solr 가 ElasticSearch와 비교해서 완전하게 결여된 것 중 하나는 ElasticSearch 에서 퍼컬레이터 라고 부르는 기능입니다. 예를들어 인덱스에 도큐먼트를 넣는 것이 아니라, 쿼리를 넣고 새로운 도큐먼트의 인덱싱 때마다 인덱스된 쿼리가 매치하는 지 어떤 지를 체크하는 것이죠. 좋지 않나요? 편리한 점은 유저가 모든 새로운 도큐먼트(소셜미디어나 뉴스 같은 것을 생각해보세요)가 유저가 흥미를 가지는 토픽에 해당되는 지, 앞서 정의한 쿼리를 통해서 체크하고 싶은 경우에 유용합니다. 이 기능은 또 Prospective Search 라고도 불립니다. 또 어떤 사람은 Pub-Sub라고도 부르고, stored search 라고도 부릅니다. Sematext 에서는 Solr를 사용하는 고객을 위해서 이것을 몇 번인가 구현한 적이 있습니다. 하지만 ElasticSearch는 기본적으로 이 기능을 가지고 있죠. 만약 ElasticSearch의 퍼컬레이터에 대해서 좀 더 알아보고 싶으시다면 다음 URL 을 참고하세요.

http://www.elasticsearch.org/blog/2011/02/08/percolator.html

다음회 예고

In the next part of the series we will focus on comparing the ability to query your indices and leverage the full text search capabilities of Apache Solr and ElasticSearch. We will also look at the possibility to influence Lucene scoring algorithms during query time. Till next time :)

다음회에서는 Apache Solr 와 ElasticSearch 상의 인덱스에 대해서 쿼리의 능력과 전문검색의 능력에 대해서 알아보도록 합니다. 또 쿼리시에 Lucene 의 스코어링 알고리즘을 다루는 방법에 대해서도 알아볼 예정입니다. 그럼 다음회에서 봐요.

@kucrafal, @sematext

← Older Blog Archives Newer →