
TRIM(), SUBSTR(), REPLACE(), INTR() and the Question from Hell
Note: I am just finishing my first SQL class on Coursera. So my judgement of how difficult a question might be needs to be understood on that basis. That’s for the fancy pants actually know SQL types. For those just learning like me… Everything I say below might be entirely wrong. Please do not take anything I say here as good advice regarding SQL. I mean, you really shouldn’t take my advice for anything.
So, final assignment for SQL for Data Science. We’re working with a database from Yelp!. Part 1 was a bunch of questions asking us to profile data from this database. How many entries here, how many there. Not too hard. Done. Part 2 we are asked to “come up with our own inferences and analysis of the data for a particular research question [we] want to answer.” On to Part 2, Question 1!
Part 2, Question 1 took me three hours. The whole project is supposed to take two hours.
The question asked us to look at the entries from some category, in some city, and compare the star ratings customers had given businesses to their opening hours and see if we could make any interesting inferences. The “stars” attribute had data in one-digit floats (e.g. 2.0, 3.5). Easy peasy to work with. Data in the “hours” attribute were strings of the form “Day|Openingtime-Closingtime,” for example “Thursday|1100-1700.”
So before I could really do anything with the opening hours, I had to get rid of the day and the ‘|’ from that string. Leaving me something like “1100-1700.” No problem…
TRIM()
I thought a while and then decided to poke around the class discussion boards for some ideas. I ignored anyone posting code because I wanted to solve the problem myself. I was really in the class to learn. I did need some inspiration though.
Some of the students were talking about TRIM(). I guess if we could just “trim” off those letters and the upright slashy thing (too lazy to look up its name) we’ll be good right? Yeah, no. That’s not really what TRIM() is for. You can’t trim patterns. I was able to trim a letter here or a letter there, but it was not meant to be.
REPLACE()
If I could just replace every letter and “|” with nothing I’ll be golden! Ok, how do we tell SQL that we wanna replace all the letters. All the letters are like [A-Za-z] right (like Python?)?
Yeah, no. It seems that REPLACE()will only take one particular string and replace it with another. But you can “nest” REPLACE()’s. So all I would have to do is tell SQL that I want to replace each letter and “|” with “”. So 53 nesting REPLACE()’s for each upper-case and lower-case latter. Great. (Yes, I know there’s another option. Now I know. The caffeine was getting to me then.)
TRANSLATE()
Luckily there’s TRANSLATE(). TRANSLATE() does exactly what I wanted REPLACE() to do above. I can feed it a pattern with wildcards and it will replace all of them with a particular string. Great. It’s nice to have this problem solved. Wait. SQLite doesn’t recognize TRANSLATE(). And we use SQLite. Why? I dunno. People on the internet don’t seem to know either.
SUBSTR()
This is clearly a SUBSTR() issue. All I have to do is create a substring that only includes the times. This would have been easy IF DAMN YELP USED FOUR DIGITS FOR ALL THEIR TIMES. 3pm is “1500.” 9am is not “0900,” it’s “900.” The actually time part of the string could be 7-9 digits long. Why Yelp?!?! Why? The SUBSTR() function takes for arguments the string and the starting point of the desired substring (and the number of characters you want, but we wouldn’t need that). If Yelp! used four-digit times, I could start all the substrings at –9 (can you do that in SQL? I know you can in Python, which I’m also badly learning). That would be great. But I don’t know where the time starts… or do I… I do! The “|”! How to I find the index of the “|”?
INSTR()
INSTR() will give me the index of a particular character or string! If I just feed that into SUBSTR(), it will find the “|” and I’ll have a beginning index for my substring!
So when INSTR() didn’t work in our project SQL environment I panicked (again). I looked back at the discussion forums to see if anyone else was dealing with this as well. They were. Most of the concerned messages went unanswered but one was. A mentor stated that INSTR() was not necessary to complete the course and therefore the course environment didn’t recognize it. Maybe it wasn’t necessary, but it would be really nice to have! Back to the drawing board. And back to…
Nested REPLACE()’s
For all you nerds out there and read the REPLACE() section and said, “ Why don’t you just…,” okay I finally did it.
Of course, I didn’t have to replace every individual letter. I only had to replace every day name. Seven nests…
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(h.hours, ‘Monday|’, ”),’Tuesday|’, ”), ‘Wednesday|’, ”), ‘Thursday|’, ”), ‘Friday|’, ”), ‘Saturday|’, ”), ‘Sunday|’, ”) as hours_trimmed
There was a bit more to do with this question, but this was the part that took a lot of thinking and research. And hours.
There’s still probably an easier way.