So what do you do when you have 240 Million profiles and you want to search on first, last, middle names + locations + tags ?

And you are stuck with MySQL cause Oracle is way out of your budget?

Well – give sphinx a try!

I know that sphinx is more of a text search / stats engine but with a little trial and error and some sneakiness you can achieve some cool results, so here we go:

Testing Sphinx 0.9.8-release (r1371)

Preliminary results (sphinx):

STAGE 1:

[root@web01 /]# search –config /usr/local/etc/sphinx.conf -q -a real -l 5 -s “results desc” -i test2

Sphinx 0.9.8-release (r1371)

Copyright (c) 2001-2008, Andrew Aksyonoff

using config file ‘/usr/local/etc/sphinx.conf’…

index ‘test2′: query ‘real ‘: returned 1000 matches of 34358 total in 0.039 sec

displaying matches:

1. document=32792, weight=1, results=827830, date_added=Wed Jun  6 11:23:14 2007

2. document=3376336, weight=1, results=4302, date_added=Tue Aug  5 10:59:12 2008

3. document=18765, weight=1, results=3586, date_added=Tue Jun  5 21:42:00 2007

4. document=772274, weight=1, results=3122, date_added=Wed May  7 21:13:59 2008

5. document=113495, weight=1, results=1215, date_added=Tue Sep 18 07:31:07 2007

words:

1. ‘real’: 34358 documents, 34402 hits

(Mysql 5.0.37):

mysql> select id from employers where name like ‘%real %’ order by results desc limit 5;

+———+

| id      |

+———+

|   32792 |

| 3376336 |

|   18765 |

|  772274 |

|  113495 |

+———+

5 rows in set (7.57 sec)


Umn … lets see… I’d go with SPHINX, want more ?

Here it is:

STAGE 2:

Goal:
a. Count how many profiles are there in “Terre Haute” city_id 1906883 and has WORK TAGS (Strategic Supply Chain Manager, Strategic Supply Chain Management 5 inc., Strategic Supply Manager, strategic supply)
b. Get first 20 profile ids order by links desc

Lets start with good old MySQL 5.0.37 :

a.
mysql> SELECT COUNT(1) AS cnt FROM region_global_tags_222 AS pgt JOIN search_city_3837 AS loct USING(profile_id) WHERE ((pgt.global_tag_type=”employer” && pgt.global_tag_id IN (223209,141623,4561058,331778))) AND pgt.region_id=3837 AND loct.city_id=1906883;

+—–+

| cnt |

+—–+

|   1 |

+—–+

1 row in set (2 min 7.14 sec)

b.

mysql> SELECT profile_id AS cnt FROM region_global_tags_222 AS pgt JOIN search_city_3837 AS loct USING(profile_id) WHERE ((pgt.global_tag_type=”employer” && pgt.global_tag_id IN (223209,141623,4561058,331778))) AND pgt.region_id=3837 AND loct.city_id=1906883 order by pgt.links desc limit 20;

+———–+

| cnt       |

+———–+

| 147563021 |

+———–+

1 row in set (3 min 4.47 sec)

TOTAL:  5 MINUTES and 11.51 SECONDS !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Lets try with SPHINX:

[root@web01 pavel]# search –config /usr/local/etc/sphinx.conf -e ‘@global_tag_type employer @global_tag_id (223209|141623|4561058|331778) @city_id 1906883′ -l 200 -s “links desc” -i tags_region_3837                                                                                                                                           Sphinx 0.9.8-release (r1371)

Copyright (c) 2001-2008, Andrew Aksyonoff

using config file ‘/usr/local/etc/sphinx.conf’…

index ‘tags_region_3837′: query ‘@global_tag_type employer @global_tag_id (223209|141623|4561058|331778) @city_id 1906883 ‘: returned 1 matches of 1 total in 0.014 sec :)


0.014 SECONDS vs. 5 MINUTES and 11.51 SECONDS

This is one of the uses of Sphinx that can increase your performance and decrease some DB load :)

Comments ?