Kyle Schmidt Software Craftsman

MySQL Indexing for Performance

I just learned about indexing on a MySQL database at a fairly high-level and will attempt to summarize what I’ve learned. MySQL has many different kinds of indexes (full-text indexes, spatial indexes, TokuDB - fractal tree index, ScaleDB - Patricia Trie, InfiniDB - special logic) but for most use cases, InnoDB and MyISAM are the two most common.

Until MySQL 5.5, the default index was MyISAM. MyISAM is good for analytical use cases because it has table-level locking which means two users cannot query the same table at the same time. Because of this, MyISAM can get away with not being ACID compliant as well as non-transactional. Furthermore, MyISAM doesn’t support relational contraints like Foreign Keys which means that it is better to have wider tables in MyISAM rather than a First Normal Form database.

InnoDB vs. MyISAM

Both InnoDB and MyISAM leverage B-tree data structures to improve query speeds. In a B-tree index, values are stored in order and each leaf is at the same distance from the root level. However, MyISAM and InnoDB again differ in that MyISAM stores keys and data in internal nodes whereas InnoDB only stores data in its leaf nodes with the internal nodes only containing pointers to their children. We call the InnoDB version of B-tree a B+tree to differentiate it from the one used in MyISAM. Due to the fact that data is only stored in the leaf nodes, InnoDB can store more indexes in memory as it traverses the tree and is therefore much smaller than a B-tree.

But you may be wondering, “Why B-trees?” B-trees are used because it speeds up data access. Storage engines are able to traverse the tree from the root node to the leaf node with the help of pointers even though the locations of these nodes might be spread out in the computer’s memory. We also see increased performance for full value (“Kyle Schmidt”), leftmost value or column prefix (“Kyle” from “Kyle Schmidt”), and range of values (1 to 99) query patterns as well as aiding our ORDER BY clauses because nodes point to each other in order (more to come).

Traversing Clustered B-Tree Indexes

Sometimes a single node representing a key still doesn’t offer the efficiency that we need. What if we want to store keys that are close in proximity such as employees of a certain profession? This is where cluster indexes come in to play. We will use cluster indexes throughout the rest of this article to talk about B-trees. Cluster indexes have the same structure as B-trees but each “node” of a cluster B-tree contains a cluster of keys rather than a single key.

The advantages of cluster indexes are that related data is stored close to each other leading to less disk I/O while retrieving sequential or range data rather than performing a full table scan for indexes that are similar. Furthermore, cluster indexes allow even further optimizations for data retrieval since data and indexes are stored together at leaf nodes. Lastly, leaf nodes point to each other. Thereby, if you perform a query on a range of data that is held within two internal cluster nodes of the cluster B-tree (in the simplest case) then the leaf nodes will be able to point to the next cluster in order without having to backtrack up the tree.

The disadvantages of a cluster index are that data insertion speed is dependent on the order of the primary key. The table needs to be re-indexed/optimized if inserted data is not ordered by primary key which could potentially take quite a long time - time that your application might not afford to spend. Clusters can split when new data is inserted which can lead to fragmentation of the cluster indexed B-tree structure. All of these contribute to poor performance if the cluster index isn’t re-optimized when performing an ill-adviced insertion.

B-trees support secondary indexes which further aid our queries. All tables should have multiple secondary indexes aside from primary indexes in order to aid query optimization. In InnoDB, the secondary index does not store actual data but only contains a pointer to the row of data. However, if the secondary index is coupled with a primary index then the secondary index will contain a pointer to the primary index. Secondary indexes are necessary because our queries usally make use of multiple columns of a single SQL table and therefore a single primary index will not suffice. For example, our first query could be on a column labeled first_name and a second query on a column labeled last_name. We would need two different indexes over both of these columns in order to optimize these queries. To carry-out our example with last_name being our secondary index and first_name as our primary index, the last_name index will point to the index for its corresponding first_name which will contain the data that it represents. Similarly, a last_name only query will require the secondary key index to find the appropriate last_name and then the pointer will jump back to the primary key index before data retrieval.

Traversing Clustered B-Tree Indexes

Yet, there is one last type optimization that I want to talk about, Hash Indexes. Hash indexes are built on top of hash tables and increase query performance for exact lookups. You can think of a hash as a function that when given an input computes the key of a map or dictionary-like data structure. Hash indexes work by creating a hash value or key for each row of data. Given an exact match of that row of data, the hash table will compute the hash of that match which ‘salts’ to the index of a hash table pointing to its representative row of data. Generally, different keys generate different hash codes but if multiple values have the same hash code, then the value of the hash code in the hash table will be a linked list of row pointers. Hash indexes are even faster than B-trees when a key doesn’t contain a linked list and effective since the table resides in memory.

Traversing Clustered B-Tree Indexes

However, there are limitations to Hash indexes. Hash indexes only contains the hash code and pointers to a row of data, not the data itself. Because it doesn’t contain the original data, hash indexes are not suitable for sorting or partial matching - it can only support the equal operator. Furthermore, I also mentioned that it is possible for multiple rows to have the same hash code, creating a linked list. This will slow down query time considerably in these cases, creating a linear seach for a node row pointer.

But what if there is a way to leverage both the efficiency of B-trees and Hash indexes? This is exactly what InnoDB does. It uses what is called an Adaptive Hash Index which sits on top of a B+tree. This comes “out-of-the-box” when you use InnoDB indexing with the caveat thast it can’t be modified of configured. The Adaptive Hash Index works by evaluating a query for its Hash Index _and_the B+tree Index, allowing us to leverage the quick lookup of hash tables and the traversal of relative nodes in a B+tree.

Slides provided by Pluralsight.com


Continue reading...

Rut

Rut (noun):

  1. A long deep track made by the repeated passage of the wheels of vehicles.
  1. A habit or pattern of behavior that has become dull and unpreductive but is hard to change.

Have you ever had that (seemingly temporary) feeling that you were in a rut? Well, what if I told you that the rut you’re feeling isn’t as temporary as it seems? What if I told you that ruts are there all along and that ruts are our monotonous habits that we only feel after they have finally settled in and made themselves comfortable? A few months ago I felt like I was in a rut and I wish to explain to you how I dug myself out and continue to work on my habits so that the rut doesn’t reappear.

When a rut encroaches, you really feel it. You feel lethargic, work becomes that much harder, and you start to feel sorry for yourself. When I feel a rut come on, I reach for comfort which for me are self-help books. This time, I came across a great book that I recommend for anyone needing inspiration and a career wake-up call: Big Magic by Elizabeth Gilbert.

Throughout the book, Gilbert emphasizes the fact that we are all creative people despite how uncreative we feel during our deepest ruts or how small we feel in the scheme of our wildest dreams. Through it all, we must remember the joy that creativity brings. It is a magical moment where, for intermittent periods of time, we are able to live outside of ourselves, to be bigger than who we currently are. Gilbert wants us and is begging us to never forget those moments of creativity and eudaimonia which we should use to fuel our passion and right the ship to align with our dreams.

But I argue that this is not enough. Our past habits are clearly not working and if we want to get out of our rut (definition 2) then we must adhere to rut (definition 1). This is what I have found to be more profound than looking internally (at our current habits) for answers.

What do I mean by rut (definition 1)? When facing a rut, I now look to my heroes, the giants in my field who I aspire to become. I study their art, their work ethic, and their philosophies. These are the men and women who have paved the path to success and by following your heroes, you too can achieve greatness. In order to dig yourself out of your rut (definition 2), you are going to have to become aware of and adhere to the ruts (definition 1) created by your heroes and their disciplined habits.

A well-known example of someone who looked to his heroes for inspiration was Benjamin Franklin. Franklin wished to improve his writing for his books, notes, and legislation while also widening his vocabulary. In order to do so, Franklin used The Spectator, a popular newspaper at the time and one he recognized for its impeccable writing style, as a tool he could use to emulate and elevate his writing skills. Franklin would:

  1. Read the essay in the paper.
  2. Make notes for each sentence he had read and set it aside.
  3. Look at the notes and try to replicate the essay in his own words (he would sometime jumble his notes to make this exercise more difficult).
  4. Compare his version of the essay to the original.
  5. Revise and improve his version.

If you adhere to your heroes’ ruts (definition 1) then you can dig yourself out of your slump and recover to a greater place than you were before. Coincidentally, this technique also happens to be one that Anders Ericsson heavily advises in his new book Peak, his study of the best athletes, musicians, and USA Memory Champions of the world and their practicing habits. This is a topic for another time but I will briefly say that by studying your heroes and giants you will also gain perfect mental representations or mental models of the skill you are trying to achieve which you can use as a guide for emulation.

Read Big Magic, I thoroughly enjoyed it and I believe you will as well but recognize that you will only get so far following Gilbert’s advice. If one wishes to break their habits then I advise them to look to their heroes and study them. Study their approach to their art and study their art itself. This should not only reinspire you but will provide a template or mental model that you can use to emulate, compare, and attempt to exceed.


Continue reading...

Emacs Beginner Resources

In October of 2015 I was thinking about branching out from Python, my first language and one that I had been using for about a year. I came across Peter Norvig’s essay Teach Yourself Programming in Ten Years and decided to learn Lisp which inevitably led me to Emacs. By then, I had been using Vim for about three quarters of a year and was familiar with the basics but I never felt comfortable. By now, I recognize that my lack of comfort in Vim at the time was due to the fact that I never gave it the respect it deserves. But in the process, I have stumbled upon something magical: a text editor… no, an operating system that is fully customizable and configurable to the way I work. Emacs is an environment that I have tailored so that no matter what I am trying to accomplish within a given day, I am still operating within the same familiar application.

However, my initial Emacs journey was slow to matriculate and I have since discovered teachings that I wish I had known when I began my Emacs pursuit. When I first started using Emacs, I was quick to attempt to customize my text editor, taking snippets of Emacs lisp from the masters’ init files. This soon grew unwieldy and I had Elisp snippets in my init that I had no idea of the functionality or benefit they provided. Instead, I have learned the importance of mastering basic navigation and keybindings pertaining to the major modes that I use the most. I didn’t discover this path on my own, this is the advice of redguardtoo. Since reading his advice, I went through the Emacs tutorial again taking care to deliberately practice and hone the basic keybindings. I then completed my first reading of the Emacs manual, started following blog posts, and reading the masters’ init files. At the same time, I read Mastering Emacs from which I gained an incredible amount of advice and knowledge. Through this book I learned the most important keybindings with which all Emacs users should be familiar, pertinent Emacs packages to go along with a basic Emacs installation, and most importantly how to ask Emacs questions using the built-in documentation when you need help. This book alone was an incredible resource; I wish I had read it sooner in my development and it will continue to be a resource that I constantly reference for help (along with the built-in Emacs info pages of course).

Let me know of some Emacs references that you found helpful during your development. What basic Emacs packages do you feel are pertinent to your workflow that I am missing from my init file?


Continue reading...