
Image via Wikipedia
I recently had the dubious pleasure of having to deal with a single exported database table with over 3 million records in it. That in itself isn’t a problem, but slugging around a 3GB CSV file and then having to wangle it six ways to Sunday to get it into a MySQL table had a little higher PITA factor than I was looking for.
One of my earlier attempts at getting the data into the MySQL database was to use the (relatively) new CSV engine that MySQL supports. This engine actually allows you to completely by-pass loading your CSV file into MySQL; rather, you can create a table using the CSV engine, and then just replace the data file within the MySQL directory with your actual CSV file. Presto blammo! Instantaneously loaded data. The only real downside from my perspective is that the CSV engine does not support indexes which, when you’re dealing with over 3 million records, is a deal breaker.
More information on using the MySQL CVS engine can be found here.
I ended up creating the table (which has 314 fields) and then using the MySQL LOAD DATA INFILE command to get the file into a regular old ISAM table which supports indexes. I then turned my thoughts to what, exactly, I should index.
First, keep in mind that an index is a method the database engine can employ to break rows up into smaller chunks so it can know very generally where things are in the table. When you query the table, the index is consulted to quickly get to a general set of records and then each record is examined to find the exact one(s) you want. Think of indexing like an address book with alphabetic tabs sticking out. If you’re looking for your friend George Smith’s address, you can open it to the S tab and then thumb through each page looking for George. That’s orders of magnitude faster than starting at page one of the As and then looking at every single page until you stumble across George way down in the Ses. Those tabs are essentially indexes.
Ok, so knowing that, we should index every field, right?
No. We shouldn’t. It is possible to over-index a table and end up shooting yourself in the foot. There are certain types of fields which benefit exactly nothing from being indexed. Here are my very basic, database engine agnostic, and high level requirements for choosing what fields to index in a table:
- If the field is unique in every row, do not index it. You will actually lose speed because the index will not prevent the database engine rom having to look at every row anyhow, and it has the additional baggage of the useless index to contend with.
- If the field is the same in every row, do not index it for the same reasons listed in the first point
- Fields that you think you will probably be using in your WHERE clause should probably be indexed because those are the fields that the database engine is going to be looking at. But remember the first two points.
- Fields that have over 50% unique values in your table are EXCELLENT candidates for indexing.
Those are my very general rules that I consider when determining what fields to index in a table. Each situation is unique and there are literally hundreds of other factors to take under consideration but if you stick with just these four, you’re more than halfway there.
Tags: Comma-separated values, Database, Database engine, FAQs Help and Tutorials, File Management, MySQL, Open Source, Table
Rate this post:
Some related posts:
`