Elasticsearch – Part 1 – Why we chose it?

Understanding our searches and listings.

End of last year, we started working on ways to understand our visitors and one of the things we were interested in, was to understand what they searched and how often they found what they were looking for. On the other side of things, we also wanted to know what happens to the listings our sellers posted, things like how often they show up and in what search queries. Then there were things like, what are the most popular car makes and models, in which regions, and other patterns that we may find. In the end we wanted to correlate all this to understand what is happing on our sites, from the individual listings to the bigger picture. Eventually this will help us to improve our systems, and later integrating this data within the sites may provide an improved experience for both buyers and sellers.

We started exploring ways we could store this data. Getting the data was the easy part as what the user searches comes to our internal API. The challenge was storing the data and in a manner that later on not only be used to understand but integrate back into our system.

There were two important questions we asked ourselves. One was where to store the data and the second one was how to make it meaningful?

 

So how do we store this data?

As we already were using Apache Solr as the search engine for our sites, our first thought was to somehow enable logs in Solr and get those logs into a format, which we could analyze. 

elasticsearch1

On our search for something that did this, we came upon the ELK (Elasticsearch, Logstash, Kibana) stack, which almost sounded like what we wanted.

The Logstash would take the Solr logs and dump them into Elasticsearch, the Elasticsearch would allow us to query it, and Kibana would use the Elasticsearch to graph it. Elasticsearch is a search server based on Lucene which provides distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is written in Java and is open source using the Apache License.

We did a run the ELK using Solr logs. It did work but then came the inflexibility. Logs contained only what the user searched but what it didn’t have is where the query generated and other data we may need. Also the other problem was, how to get what the visitor saw as a result of the search. That would require post processing as we would have to pick the search query and fill it with result at a later time.

We put aside ELK for now and started looking for alternatives. We went doing so, ruling out RDBMS, by starting to explore some of the nosql databases and other post processing technologies. We went through MongoDB, Hadoop, HIVE, Cassandra and VoltDB.

es2

 

One of the solutions that we worked out was involving Cassandra. As Cassandra benchmarks were the best, with the high number of writes in less time and it’s compressing of data on storage therefore requiring less disk space, it seemed almost the thing we needed.

We first created a basic schema, creating collections, for Cassandra and wrote an API endpoint to write some dummy data into it. Then we ran basic load tests using JMeter and storing the data. The writes were great, and the data space taken by Cassandra was low. But while implementing this, we felt we were reworking the schema and rethinking about what we want to store, changing then implementation. One of the thing that was a bit bothering is the post processing we might have to do if we chose Cassandra. As the data would be raw format, we would have to create usable data, working and changing the schema initially to get to the point where we have the desired result. Plus then do post processing on it. Since we are only beginning to explore how the data we wanted, could be used, we needed something that would not require less processing and would bring our data into a format which we can query for aggregations and perform analytical queries on.

Elasticsearch

es3

We then went back to ELK stack, and decided, we didn’t need Logstash. So just started doing a similar process above for testing, but just using standalone Elasticsearch for now. From our experience with Cassandra, we know the first thing to do was to finalize the fields from the search query we wanted to save, and what data from the search result we wanted to record for each individual search query. All this is anonymous data but having this decided early on was a plus.

We did a rough calculation using the numbers from our New Relic and Google Analytics and came up with a rough number of requests we were anticipating. We then wrote scripts to populate dummy data into Elasticsearch and see the size it takes to store documents (with each document containing average number search parameters and one search result). We had an estimated size of data that we would get but what about the load? So we started by sending concurrent write requests  to the server. We started off with JMeter to create the concurrent requests. With limited success, we were able to test it. Unlike Cassandra, which was on our local server, Elasticsearch we had deployed on an AWS machine. So we ran into a bandwidth bottleneck while running testing. So we decided to run the benchmark form another AWS machine, we had on the network with the Elasticsearch machine. During this time, we moved out of JMeter and started using Apache Benchmark for the concurrent tests. And this is when we decided to go with Elasticsearch. Elasticsearch was easily managing the number of writes we estimated and the data we saw was easy to query. The only concern we saw was the disk size. Our initial assessment showed, 2 TB of data for 3 months (if we had one search query with one search result), which would be a lot more as the search results are normal 10 at average.

es4

 

Then we came to the question of how to get the data in the Elasticsearch. Of course the API directly writing to Elasticsearch was the easiest solution. But there were three concerns in this.

First was we didn’t want our API endpoints doing extra work and slowing down. Secondly, in case of some delay in Elasticsearch write, we didn’t want our API endpoint to slow down. Thirdly, in case of error in Elasticsearch, we didn’t want it affecting the original endpoint and also have a certain retry mechanism.

To avoid all this, we decided to let our Queue server (Fresque) do the writing to Elasticsearch. Our API would just create a job with the search query and forget about it, without having to do anything else. It was the job’s task to generate the Solr results, process the search query parameters, and do any post processing work and then save it into Elasticsearch. This would ensure our normal site would function as is, but the load would shift to the queue server. I’ll discuss Fresque in detail and other load testing related things in the Part 2 of the article.

Why Elasticsearch and not Solr?

There is the question that was nagging, in our minds, why didn’t we just go ahead and use Solr. Its also based on Apache Lucene, and has great search feature, and we already have experience managing it. Well the answer lies in what we needed in this case. We chose Elasticsearch cause of how it indexes data, the analyzers we can use and also the ability to use the nested and parent-child data in it, but we mainly chose it because of the analytical queries it can do.

Solr is still much more for text search while Elasticsearch tilts more towards filtering and grouping, the analytical query workload, and not just text search.  Elasticsearch has made efforts to make such queries more efficient (lower memory footprint and CPU usage) in both Lucene and Elasticsearch.  Elasticsearch is a better choice for us as we don’t need just text search, but also complex search-time aggregations.

The way ElasticSearch manages shards / replication is also better than Solr, as it’s a native feature within it and more control but we didn’t put that in the consideration, although that itself is a good reason.

Conclusion

So in the end, we went with Elasticsearch, compromising on the high data size, for it’s ability to aggregate data and make the data searchable and also enable us to perform analytical queries, reducing the effort to process data. Elasticsearch can transform data into searchable tokens with the tokenizer of our choice and perform any transformation on it and then index the needed fields. It also supports both nested objects and parent-child objects, which is a great way to make sense of complex data. Then there is the wonderful Kibana. It can plot graphs using ElasticSearch and give us instant meaning.

 

Next up

Elasticsearch – Part 2 – Implementation and what we learned.
Elasticsearch – Part 3 – A few weeks fast forwarded and the way ahead.

iCar Asia Product And Technology Hackathon Day

Winter is coming.. We are going to have a Hackathon day.. For some reason, both of these sentences meant the same to the fun peeps of iCar Asia Product and IT team. Maybe because this idea was over-discussed and never actually happened (for the year 2015); just like the brothers of the Night’s watch were told too much about the White Walkers. And it was just a myth for the brothers until they actually saw the White Walkers raiding the Hardhome as the Free Folk boarded ships bound for the Castle Black [Game Of Thrones Season 5]. So, we the Product & Technology culture committee members: Faraz, Divya, Yi Fen, Syam, and Salam (myself), made sure the equivalent of the White-Walkers-Raid happened at iCar so that it couldn’t stay a myth anymore. Yes, I’m talking about arranging the Hackathon day for our team.

After ‘how and when are we going to arrange it’ discussion in the culture committee meeting, this is the email I sent out to the team on the 15th April 2015.

Hi Team,

As you all know, we have been discussing about the ‘Hackathon‘  for quite sometime now, let’s actually do it.

The culture committee, as a team, has agreed on making it happen next ‘Tuesday 21st April 2015’ and Joey – our beloved CIO has approved it too. So get your turbo-creativity charged: you’re gonna need it.

There are few rules which we are gonna share with you later this week. The basic idea is to start the Hackathon officially on Monday evening 5 PM: you can form a team, think about the idea, and start working on it on Monday itself (after 5 PM). You can work on your ‘great idea’ until Tuesday 5 PM.

After 5 PM Tuesday, each team (turn by turn) will present whatever they have worked on and then ideas will be ranked based on a preset criteria (Which we’ll share later with you).

Don’t forget, there are prizes too (for the first and the second best teams).

Get ready folks: Winter is coming ;).

Thanks.
Salam Khan

There was a mixed feedback about the email. Many thought it’s just another promise email and nothing is going to happen, however the push and the feel from the culture committee team made them feel that it’s real and not just another promise.

Hackathon Guidelines / Rules

To emphasize on the idea of this hackathon being real, very next day I wrote these guidelines, discussed with the culture committee and shared with everybody in the Product and Technology team. There are some points which were taken as fun but when explained, team agreed to follow those.

Read to enjoy :).

Team Guidelines

  1. Each team must consist of more than 2 members but not more than 5 (follow the Hipster, Hacker, and Hustler approach)
  2. Syam and David cannot be in the same team
  3. Manju and Faraz cannot be in the same team
  4. Arvind and Tanveer cannot be in the same team
  5. Joey and Pedro cannot be a part of any team
  6. Any team cannot consist of more than 2 .NET devs
  7. Any team cannot consist of more than 2 PHP devs
  8. Any team cannot consist of more than 2 QA
  9. Syaiful and Juliana cannot be in the same team
  10. Alain, Geetha, and Celine cannot be in the same team
  11. Sonny and Jackson cannot be in the same team
  12. Albert and Salam cannot be in the same team
  13. Team can be formed anytime now until Thursday 5 PM but the actual work must not start before that
  14. Teams will have 24 hours – From 5 PM Thursday 23rd April 2015 To 5 PM Friday 24th April 2015 to work on their idea
  15. Teams can spend 24 hours in the office if they want to
  16. The output of a team can or cannot be a working software. It can be a prototype, a software, or even a presentation
  17. There must not be any single P&T members left without being a part of the team

Jury and the general rules

  1. Joey and Pedro (and the overall clapping for each team)
  2. Ideas will be rewarded on the base of:
    1. Innovation and creativity
    2. Impact on society
    3. Market viability
  3. Each team will get 5 to 7 minutes (not less than 5 minutes and not more than 7 minutes)
  4. No drug or creativity-enhancing stuff (other than Redbull and Coffee) can be used throughout the Hackathon

Prizes

  1. First team gets Raspberry pi 2 Model B (for each member)
  2. Second team gets iFlix Annual Subscriptions (for each member) for 2015
  3. All teams will get a certificate of Hackathon participation (for each member)

First draft by Salam, approved by the Culture Committee, and Joey.

We told everybody they only have one day to form their teams and day after tomorrow (from April 22nd) is the Hackathon day. And this time we asked the culture committee members to make sad or angry faces. We did that. And it actually worked.

Hackathon Teams

Within next day we had these 4 teams formed.

ATAMS (Pronounced as ATOMS)

  1. Alain
  2. Tanveer
  3. Ashok
  4. Mayur
  5. Salam

HEAVYWEIGHT (Yeah, most of them are heavyweight indeed)

  1. Syaiful
  2. Bob
  3. Manju
  4. Syamsul
  5. David

The Winning Team (It doesn’t mean they won or something ;))

  1. Fahad
  2. Zeeshan
  3. Wei Fong
  4. Juliana
  5. Celine
  6. Jackson

Juz Bananas (Yeah, whatever, they won!)

  1. Faraz
  2. Shahzad
  3. Daniel
  4. Yi Fen
  5. Arvind
  6. Lakshami

Team Projects.

It really happened. Every group worked very hard and used their creativity to innovate something new. Team Projects were as follows.

Carlist Desktop Chat project

Carlist-Chat-Project

(Langkawi) Travel Mobile App

iCar-Hackathon-Plan-My-Trip

CanCan Lunch App

CanCan-App

Carlist Desktop – One Stop Shop for buyers and sellers

One-Stop-Shop-Carlist

And the first and the second prizes went to…

All four ideas were really appreciated by the Jury and the audience (business people from other departments). But at the end the number one idea was the ‘Chat Project for Carlist‘ which won the Jury’s hearts, followed by the ‘Chat App‘ winning the second prize.

Random clicks

Here you go, some random clicks from the Hack Day.

iCar-Hackathon-Day-1 2 Small
iCar-Hackathon-Day-2 Small
iCar-Hackathon-Day-3 Small
iCar-Hackathon-Day-4 Small
iCar-Hackathon-Day-5 Small
iCar-Hackathon-Day-6 Small
iCar-Hackathon-Day-7 Small

In the end, I would like to use this platform to thank everybody (current and ex alike) at iCar Asia Product team who helped us arrange this amazing hack day. As we all believe that “a journey of a thousand miles begins with a single step” and the first step is always the toughest one, I hope that more of these Hackathons / Hack Days will keep happening at iCar Asia and the fun peeps at Product and Technology team will keep innovating.

Cheers.

Migrate Old URLs to New URL Structure Using Nginx and Redis.

While maintaining a website, webmasters may decide to move the whole website or parts of it to a new location. For example, you might have a URL structure which is not SEO or user friendly and you have to make it one. Changing URL structure can involve a bit of effort, but it’s worth doing it properly.

It’s very important to redirect all of your old URLs traffic to the new location using 301 redirects and make sure that it’s possible to navigate it without running into 404 error pages.

To start with, you will need to generate a list of old URLs and map them to their new destinations. However, this list can grow bigger and bigger depending on the size of your website. Storing this mapping also depends on your servers and size of the website URLs. You can use a database or configure some URL rewriting on your server or application for common redirect patterns.

The problem with database is that it is slow, while file based mapping (by nginx) can take long time just to reload or restart nginx (i.e. need reload or restart nginx in case you add more redirect rules) and also take significant amount of memory depending on size of the mapping file.

Nginx  Redis - Migrate Old URLs to New URL Structure

Nginx + Redis – Migrate Old URLs to New URL Structure

Fortunately, by using Redis and Nginx Lua Module you can make this transaction smooth and the overall migration process – painless.

Requirements:

1 – Install packages nginx-extras & redis-server (http://www.dotdeb.org/instructions/)
2 – Install http://openresty.org/download/ngx_openresty-1.2.4.14.tar.gz
3 – Configure nginx
+ Add the following line at the start of nginx file (replace path with the proper location where you installed openresty module).


lua_package_path "/usr/local/openresty/lualib/?.lua;;";

4 – Add following location block in nginx file :


location ~ "^/[\d]{4}/[\d]{2}/[\d]{2}/(?<slug>[\w-]+)/?$" {

content_by_lua '

local redis = require "resty.redis"
local red = redis:new()

red:set_timeout(1000) -- 1 sec
local ok, err = red:connect("127.0.0.1", 6379)
if not ok then
ngx.exit(503)
return
end

local key = ngx.var.slug
local res, err = red:get(key)

if not res then
ngx.exit(404)
return
end

if res == ngx.null then
ngx.exit(404)
return
end

ngx.redirect(res, 301)
';

}

 

How does it work?


lua_package_path "/usr/local/openresty/lualib/?.lua;;";

This line is to tell nginx to load lua module as you are intended to use lua script in you configuration.


location ~ "^/[\d]{4}/[\d]{2}/[\d]{2}/(?<slug>[\w-]+)/?$"

This line is to check all requests with old URL pattern to fall under this block with a lua variable (i.e. slug)


local redis = require "resty.redis"
local red = redis:new()

red:set_timeout(1000) -- 1 sec
local ok, err = red:connect("127.0.0.1", 6379)
if not ok then
ngx.exit(503)
return
end

Above lua script will try to connect with Redis server (on host 127.0.01 and port 6397) with 1 second timeout.


local key = ngx.var.slug
local res, err = red:get(key)

Above lua script will get key from Redis (i.e. usgin lua variable slug which we got from regex)

Rest is quite self explanatory as it will will redirect with 301 if found or with 404 if not.

+ NOTE: In example above replace regex, redis server host & port according to your need
a. Above regex is for URL pattern /{year}/{month}/{day}/slug
b. Redis server path (i.e. host 127.0.0.1) and port (i.e. 6379)

Know another or perhaps a better way  to migrate old URLs to new URL structure? Or have used the same method for your website’s URL migration? Share your experience with us through comments. We are always happy to hear from you.