Using Word Frequency Charts for Better Word Clouds

Word clouds

Data scientists notoriously hate word clouds. Besides for figuring out what the top 2-3 words are (because they are the biggest), it is difficult to see how much one word is used relative to another. Unfortunately, clients and non-data people love word clouds and sometimes insist on them. What is a self-respecting data nerd to do?

Pair it a word frequency chart!

The easiest way to do this is by using Python’s counter:

Counter(words).most_common()

Then you can use your favorite charting tool to make a bar chart of the results. I prefer D3.js.

Results

Word Frequency Chart

Word Cloud

If you see both together, you get a better understanding of the words being used. Of course, a single word doesn’t always capture sentiment. They can be helpful in smaller data sets, but sometimes common phrases are more helpful in larger data sets. For common phrases, use n-gram analysis.

For more on visualizing text, check out episode 62 of the Data Stories podcast and the Text Visualization Browser.

Identifying and killing background Python processes

Today I learned:

How to kill rogue background Python processes

I started a simple Python http server as a background process at the end of a Python script. That was a mistake. Since it isn’t running in an open Terminal window, it is easy to forget about and it will run until you kill the process or reboot your machine. I don’t like that.

Here is how to identify and kill background Python processes in Terminal.app:

Run:

ps -elf | grep python

This will return something like this:

501 74440 74439     4004   0  31  0  2505592   9276 -      S        0 ??         0:00.29 python -m Simple  8:43AM 501 77045 77016     4006   0  31  0  2436888    812 -      S+       0 ttys000    0:00.00 grep python       8:57AM

The second column is the PID. Make note of which one you want to kill. Then run:

kill -9 PID

Example. Let’s say I want to kill the python -m SimpleHTTPServer process. I see that its PID is 74440. So I would run:

kill -9 74440

Amending Commits, Matplotlib, and More Python

I’ve been on vacation and spend the last two days catching up and not doing a lot of learning, so I’ve been lazy in putting up TIL posts. That is over. (I did, however, push some updates to my Apple Photos Analysis project.) Here is a small collection of things I learned in the last week.


Amending commits

Say you forgot to add a file to your last commit or you made a typo in your commit message. You can amend it!

Make the necessary changes, then do this:

git commit --amend -m "Commit message here"

If you’ve already pushed it to an external repository, you’ll need to force the push since the external repo will look like it is ahead. If branch protection is turned on, you’ll need to make a new commit. Make sure you aren’t overwriting anything important!

git push origin master --force

Here are the docs.


Adding data labels to the top of bar charts in Matplotlib

Matplotlib is a great plotting library for Python.

def autolabel(rects):     # attach some text labels     for rect in rects:         height = rect.get_height()         plt.text(rect.get_x() + rect.get_width()/2., 5+height,                 '%d' % int(height),                 ha='center', va='bottom') rect = plt.bar(xs, counted_hours, color=color)  # To use: autolabel(rect)

Saving images in matplotlib

plt.savefig('directory/filename.png')

Counting items that match a regex pattern

def hour_finder(regex,lines): 	time_counter = 0 	for l in lines: 		if re.match(regex, l): 			time_counter = time_counter + 1 	return time_counter 	 # To use hour_finder('^8:[0-9]{2,}:[0-9]{2,}sPM',time_csv)

Splitting!

Splitting by a space ' ' and choose the item after the split ([1] because counting starts at 0)

list = [i.split(' ')[1] for i in time_csv] 

Reading CSVs, counting, lambda expressions, and plotting with Python

Today I learned a lot in Python

Get current directory

import os os.getcwd()

Reading CSVs in Python

import csv with open('/photo_dates_location.csv') as f:     reader = csv.reader(f, delimiter=',', quotechar='"')     reader.next() #skip header     day_csv = [row[0] for row in reader]

Counting in Python

Using Counter to create a list of unique items and counts by appending items to the lists:

import counter days = [] count = [] for (k,v) in Counter(day_csv).iteritems():     days.append(k)     count.append(v)

Ordering lists with a lambda expression

According to Eric Davis the lambda expression is a good way to make quick expressions on the fly for organizing things like the Counter lists:

day_number = { 'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4, 'Friday': 5, 'Saturday': 6, 'Sunday': 7 } days_sorted = sorted(Counter(day_csv).iteritems(), key=lambda e: day_number[e[0]])

Plotting with Python

Given that the lists days and count are built above by the Counter(), you can pass them to matplotlib for charting:

from matplotlib import pyplot as plt  ######## Bar Chart ######## xs = [i + 0.1 for i, _ in enumerate(days)] plt.bar(xs, count) plt.ylabel("Number of photos taken") plt.title("Photo frequency by day") plt.xticks([i + 0.5 for i, _ in enumerate(days)], days) plt.savefig('img/weekdays_bar.png') plt.clf()  ######## Pie chart ######## colors = ['red', 'orange', 'green', 'purple', 'lightcoral', 'lightskyblue', 'yellowgreen'] explode = [0, 0, 0, 0, 0, 0.1, 0] plt.pie(count, explode=explode, labels=days, colors=colors, autopct='%1.1f%%') plt.axis('equal') plt.suptitle("Percent of total photos taken on a given day of the week", fontsize=18) plt.savefig('img/weekdays_pie.png')
  • If you don’t want to save the images, you could just show them instead with plt.show()
  • plt.clf() clears the figure so you can plot something else on it. Otherwise you’d need to close it before continuing. plt.close() can do that.

Depending on the source CSV, the above creates these two charts: Photos by day of week count Photos by day of week percentage

Pull requests, scraping Reddit, and flexbox quirks

Today I learned:

How to contribute to an open-source project on Github

  1. Fork a project
  2. Make the changes, commit, and push back up to Github.
  3. Go to the repo on Github you want to propose a change to.
  4. On the page: Choose your branch. Compare and review. Create pull request.

Pulling Reddit data with Python

  • Connect to Reddit and grab data with PRAW
  • Store the retreived information in a MySQL database with PyMySQL

The above two items came together in one learning experience. I helped Seth Millerd debug a Python script he was working on (with considerable help from Eric Davis!). It is the first public repo I’ve contributed to, and the first time I’ve made a pull request. We use git at work, but we use a shared repo model instead of the fork & pull model.


Image scaling quirks with flexbox

All about flexbox (CSS)

Eric Davis and I ran into a strange CSS issue where an image was scaling in a funky way when we resized the browser. The height was staying fixed while the width was changing, but there was nothing in the CSS setting a specific height.

It turns out that one of the parent

s had display: flex; flex-direction: column; specified for layout order purposes, and when we turned that off the problem went away. So then we went searching and the quick-and-dirty fix is wrapping the image in a vanilla

. That is working for us for now, but I want to read through the W3 docs and see if there is something bigger we are missing or if this is a known bug.