Navel Gazing on Hacker News

Published on

Being the internet-dabbling dilettante that I am, I often gaze longingly at the front page of Hacker News, wondering when my own content will reach it and waste merit the time of others. But alas, my time is precious, and I can't spend all day refreshing the browser to see when I've achieved my goal. Let's set something up to do that for me.

Mission: Write a script that scans the front page of HN for the presence of this blog post. If we detect it, update this blog post, including the time from when it reached the front page. Run the script(s) periodically for a day before accepting inevitable defeat.

How is this going to work?

Nothing fancy here - let's approach each of these steps individually before gluing them together:

  1. Create a go script that scrapes the front page of HN for our post.
  2. Create a bash script that calls the go script (and can tell whether it found our post or failed).
  3. Create a periodically running job that runs the bash script.
  4. Add conditional behaviors to the bash script:
    • If the go script succeeds in finding our post: modify our post markdown (succeeded!), create a git commit, and push to github to trigger a deploy. Clear the job!
    • If the go script fails (and we're out of time): modify our post (failed), create a git commit, and push to github to trigger a deploy. Clear the job!
    • If the go script fails (and we have time left): log some information.
  5. Deploy the blog post, submit it to HN, and start the script!
  6. Keep my computer awake and return in 24 hours.

We can test the little pieces by handing the go script existing or fictional targets to search for on the front page of HN.

Sweet! Nothing to it but to do it. A few things to keep in mind before we start:

High Level Caveats

  1. I'm going to use launchd as my job scheduler because I'm working on a Mac. If you're on Linux, you're going to need to use something else.1

  2. We will consider the "front page" to be anything in the top 30, since that's what loads on desktop before paginating. I may end up keeping track of "peak height reached", to salve my bruised ego - TBD.

  3. We're going to run the script every 10 minutes for a day and then give up. Judging from the front page of HN at this moment, it looks like nothing is up there that's more than 24 hours old. If we don't reach the front page by then, we're fine with that.

  4. We're going to write the results from running the script to a log file we can check later.

Although there are practical applications of periodically scraping websites and doing things with the data, this is obviously not one of them. Let's get to it.

Scraping Hacker News

Here's what I'm envisioning running the go script will look like:

$ go run gaze.go
- Scanning the front page of HN for our post...
Time: 14:13 EST (2020-11-30T14:13:23-05:00)
- Checking top 30 posts...
Did not find our post!

...

$ go run gaze.go
- Scanning the front page of HN for our post...
Time: 14:13 EST (2020-11-30T14:13:23-05:00)
- Checking top 30 posts...
We were 27/30 (front page!)

So let's get started by using someone's go client for HN. This one seems good enough. We can install it...

$ go get -u github.com/peterhellberg/hn
$ atom gaze.go

...and mostly use the code we need straight from the repo's example usage (with a few tweaks):

package main
// gaze.go v0

import (
  "fmt"
  "os"
  "time"
  "github.com/peterhellberg/hn"
)

func main() {
  hn := hn.DefaultClient
  fmt.Println("- Scanning the front page of HN for our post...")
  now := time.Now()
  fmt.Printf("Time: %s (%s)\n", now.Format("15:04 EST"),
    now.Format(time.RFC3339))

  ids, err := hn.TopStories()
  if err != nil {
    panic(err)
  }

  fmt.Println("- Checking top 30 posts...")
  for i, id := range ids[:30] {
    item, err := hn.Item(id)
    if err != nil {
      panic(err)
    }

    if (item.Title == "Navel Gazing on Hacker News") {
      fmt.Println("We were", i, "(front page!)")
      fmt.Println(i, "–", item.Title, "\n   ", item.URL)
      os.Exit(0)
    }
  }

  fmt.Println("Did not find our post!")
  os.Exit(1)
}

I'm using os.Exit to hand back a status code. We're going to use a non-zero status2 to indicate that a process errored. Let's run it and see what happens. I haven't posted yet, so we should just fail to find our post.

$ go run gaze.go
- Scanning the front page of HN for our post...
Time: 14:21 EST (2020-11-30T14:21:17-05:00)
- Checking top 30 posts...
Did not find our post!
exit status 1

The call to os.Exit() does output some information, but we're fine with that. Looks good though - onwards and upwards.

Creating a bash script

We need to make a bash script that calls the go script. Keeping it as simple as possible:

#!/bin/bash
# navel.sh v0

echo "Running navel.sh"
ret=$(go run gaze.go)
echo "$ret"

Let's set it as executable and run it:

$ chmod 755 navel.sh
$ ./navel.sh
Running navel.sh
exit status 1
- Scanning the front page of HN for our post...
Time: 15:05 EST (2020-11-30T15:05:32-05:00)
- Checking top 30 posts...
Did not find our post!

The exit status is echoed a little out of the order I'd expect, but that's fine for now.

Spinning up a periodic job

Half the fun of this is getting my computer to check regularly for me instead of pounding cmd + r. I initially looked into using cron and crontab, and saw the following when I ran man crontab:

...(Darwin note: Although cron(8) and crontab(5) are officially supported under Darwin, their functionality has been absorbed into launchd(8), which provides a more flexible way of automatically executing commands. See launchctl(1) for more information.)

I figure that this is a quiet way of saying "might be deprecated soon" so I'm going to use the more contemporary set of tools (launchd) for our script.

How can we create a launchd job? I feel like I've done this in a past life, but I followed this useful walkthrough by Chet Corcos to refresh my memory. For reference:

(launchd is) a unified, open-source service management framework for starting, stopping and managing daemons, applications, processes, and scripts.

...and launchctl is the command line tool we can use to start and stop those processes. We're basically going to create an XML file describing when to run our job and how frequently, and add it to the set of jobs that my computer is already running periodically using launchctl load ~/Library/LaunchAgents/foo. Unsurprisingly, we can unload the script with launchctl unload ..., referencing the same filepath.

Here's what my basic job looks like (~/Library/LaunchAgents/com.gc.navel.plist):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>com.gc.navel.plist</string>

    <key>RunAtLoad</key>
    <true/>

    <key>StartInterval</key>
    <integer>600</integer>

    <key>StandardErrorPath</key>
    <string>/Users/scott/home/navel/err.log</string>

    <key>StandardOutPath</key>
    <string>/Users/scott/home/navel/out.log</string>

    <key>EnvironmentVariables</key>
    <dict>
      <key>PATH</key>
      <string><![CDATA[/usr/local/bin:/usr/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin]]></string>
    </dict>

    <key>WorkingDirectory</key>
    <string>/Users/scott/home/guycombinator/</string>

    <key>ProgramArguments</key>
    <array>
      <string>/Users/scott/home/navel/navel.sh</string>
    </array>
  </dict>
</plist>

Simple enough - the XML file is a set of key value pairs that define a script to run (ProgramArguments), a frequency to run at (StartInterval, measured in seconds), and some output paths and environment information (WorkingDirectory, EnvironmentVariables, etc.). Note that right now we've configured the job to run every 10 minutes, which we may want to tweak for debugging later. Also notice that the working directory doesn't match3 the output paths I've referenced, and that I've moved the script to a separate directory too. I don't really want this code living in the repo for my blog, even though that's where we want to run the script (to possibly modify the markdown post and push with git).

$ cd ~/home
$ mkdir navel
$ tree -L 1 | grep -E "(navel|guycom)"
├── guycombinator
├── navel
$ mv guycombinator/navel.sh navel/

Nice. We've enabled RunAtLoad, so the script will run immediately. Let's see that everything still "works" by loading the job and checking out.log and err.log:

$ launchctl load ~/Library/LaunchAgents/com.gc.navel.plist
$ cd ~/home/navel
$ cat out.log
Running navel.sh

$ cat err.log
/Users/scott/home/navel/navel.sh: line 6: go: command not found
/Users/scott/home/navel/navel.sh: line 6: go: command not found

Aha - it doesn't look like the job knows how to run go commands. We need to make sure that the <EnvironmentVariables/> ENV value includes our go binary. Easy enough to fix:

$ which go
/Users/local/go/bin/go

So lets add the directory where go is located (/Users/local/go/bin) to the end of our set of directories to look for executables (PATH) in the plist file:

...
<key>EnvironmentVariables</key>
<dict>
  <key>PATH</key>
  <string><![CDATA[/usr/local/bin:/usr/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/go/bin]]></string>
</dict>

Once more, with feeling:

$ launchctl unload ~/Library/LaunchAgents/com.gc.navel.plist
$ rm err.log out.log
$ launchctl load ~/Library/LaunchAgents/com.gc.navel.plist
$ sleep 30; cat out.log
Running navel.sh
- Scanning the front page of HN for our post...
Time: 12:27 EST (2020-12-01T12:27:04-05:00)
- Checking top 30 posts...
Did not find our post!
exit status 1

It takes a few seconds on my machine to fetch the results, we're waiting a bit with sleep before we check for results. But it worked! So let's add some slightly more complex behaviors based on the result of the script running.

Conditionalizing the bash script

Let's make sure it can do different things based on the execution of our script. To get the exit code back from a bash script, we can use the special variable $?.

#!/bin/bash
# navel.sh v1

echo "Running navel.sh"

ret=$(go run gaze.go)

EXIT_CODE=$?

if [ ${EXIT_CODE} -ne 0 ]; then
  echo -e "Last command failed:\n$ret"
else
  echo -e "Last command succeeded:\n$ret"
fi

Let's try searching for our nonexistent post in the top 30 to check a failure:

$ launchctl unload ~/Library/LaunchAgents/com.gc.navel.plist
$ rm out.log err.log
$ launchctl load ~/Library/LaunchAgents/com.gc.navel.plist
$ sleep 30; cat out.log
Running navel.sh
Last command failed:
- Scanning the front page of HN for our post...
Time: 14:58 EST (2020-12-02T14:58:37-05:00)
- Checking top 30 posts...
Did not find our post!

And let's do it once more for something that does exist currently on the front page to check that the success case works too. I won't show it here, but I'm quickly changing gaze.go to reference something valid - "The startup failure of Oomba" was the spiciest title we could find. Quickly cleaning up and reloading the process:

$ launchctl unload ~/Library/LaunchAgents/com.gc.navel.plist
$ rm err.log out.log
$ launchctl load ~/Library/LaunchAgents/com.gc.navel.plist
$ sleep 30; cat out.log
Running navel.sh
Last command succeeded:
- Scanning the front page of HN for our post...
Time: 15:08 EST (2020-12-02T15:08:39-05:00)
- Checking top 30 posts...
We were 16 (front page!)

Nice. Let's quickly figure out how to clean up the job, since we'll want to do that when we succeed or run out of time.

Cancelling the job

Let's quickly check that a job can "unload" itself by spinning up a job that is scheduled to run every 30 seconds but also unloads itself immediately:

#!/bin/bash
# unload.sh v0

now=$(date +"%T")
echo "Running unload at: $now"
launchctl unload ~/Library/LaunchAgents/com.gc.unload.plist

It's quite similar to the previous job file, but I'll include it in full here (at ~/Library/LaunchAgents/com.gc.unload.plist):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>com.gc.unload.plist</string>

    <key>RunAtLoad</key>
    <true/>

    <key>StartInterval</key>
    <integer>30</integer>

    <key>StandardErrorPath</key>
    <string>/Users/scott/home/navel/err.log</string>

    <key>StandardOutPath</key>
    <string>/Users/scott/home/navel/out.log</string>

    <key>EnvironmentVariables</key>
    <dict>
      <key>PATH</key>
      <string><![CDATA[/usr/local/bin:/usr/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin]]></string>
    </dict>

    <key>WorkingDirectory</key>
    <string>/Users/scott/home/navel/</string>

    <key>ProgramArguments</key>
    <array>
      <string>/Users/scott/home/navel/unload.sh</string>
    </array>
  </dict>
</plist>

Firing it up quickly:

$ cd ~/home/navel
$ launchctl load ~/Library/LaunchAgents/com.gc.unload.plist
$ sleep 40; cat out.log
Running unload at: 12:00:52

Sure enough, one line of output gets written to the file instead of the two we'd expect with a 30 second start interval. Easy enough to clean up after ourselves.

Setting up a timeout

Let's generate a timeout in the future so we can know when to stop checking HN. Using date:

$ date
Fri Dec  4 14:58:18 EST 2020
$ date +%s
1607111905
$ date -v +1d +%s
1607198311
$ date -v +1M +%s
1607111978

The documentation is a little cryptic - took me some trial and error to to figure out how to represent "now" and "one day in the future" as comparable values.

  • +%s sets the output format to be interpreted as the number of seconds since epoch
  • The -v flag indicates you want a relative value (other than the current time)
  • You can do basic addition or subtraction with the characters in ymwdHMS being interpreted as year, month, week, day, Hour, Minute, and Second

So hardcoding the time we want to stop running unload.sh looks like this:

#!/bin/bash
# unload.sh v1

nowSeconds=$(date +%s)
timeout=1607111978
# ^ copied from above

if [ $nowSeconds -gt $timeout ]; then
  echo "Timed out! Unloading..."
  launchctl unload ~/Library/LaunchAgents/com.gc.unload.plist
else
  now=$(date +"%T")
  echo "$now -> running unload"
fi

Setting the timeout for 1 minute in the future and then waiting does exactly I want - the script runs until the timeout is reached and then shuts itself down.

$ launchctl load ~/Library/LaunchAgents/com.gc.unload.plist
$ sleep 60; cat out.log
14:58:03 -> running unload
14:58:33 -> running unload
Timed out! Unloading...

Feel free to experiment to develop an intuition for it. We're going to copy the pattern of hardcoding the timeout and comparing it to current time in navel.sh once we've set up modifying this post and deploying it with git.

Modifying the markdown post

We have two possible reasons to modify the markdown post - either we hit the front page, or our job timed out. So let's use sed to update this page with our results. A tiny refresher:

$ echo "Bell peppers and beef" > foo.txt
$ sed -i '.txt' 's/beef/no beef/' foo.txt
$ cat foo.txt
Bell peppers and no beef

The argument for -i specifies the filetype for a temporary backup in case anything goes wrong. Note, we could an empty string here to indicate we do not care to save a backup (sed -i '' 's/...'). We'll end up using something like the following to write our status message:

function replaceResultsWithStatus {
  needle="results"
  sed -i '' "s/%$needle%/$1/" foo/path/post.md
}

I can't actually include the literal string I'm intending to replace in this post, else it will get replaced here too. String interpolation to the rescue!

Fun with git

After we update the post in place, the only thing left to do is create a commit and push our update to deploy. Since there are different cases where we end up doing that, let's write a small function to do that work:

function commitToGithub {
  git add .
  git commit -m "$1"
  sha=$(git rev-parse HEAD | cut -c1-7)
  echo "- Creating new commit with changes... ($sha)"
  echo "- Pushing branch 'main' to Github..."
  git push origin main
  echo "- Pushed to Github!"
}

Remember that our working directory is the root directory of the blog, so git should be able to add files, create commits, and push as usual. I'm using git rev-parse and cut to generate a readable SHA for the logs, but apart from that, everything should look pretty familiar. So what does our script end up looking like?

Final versions of scripts

Here's the final version of navel.sh:

#!/bin/bash
# navel.sh v2
echo "- navel.sh running"

nowSeconds=$(date +%s)
now=$(date +"%T")
timeout=1607278640
# ^ generated via `date -v +1d +%s`

function commitToGithub {
  git add .
  git commit -m "$1"
  sha=$(git rev-parse HEAD | cut -c1-7)
  echo "- Creating new commit with changes... ($sha)"
  echo "- Pushing branch 'main' to Github..."
  git push origin main
  echo "- Pushed to Github!"
}

function replaceResultsWithStatus {
  echo "- Updating post in place..."
  needle="results"
  sed -i '' "s/%$needle%/$1/" src/blog-posts/navel-gazing-on-hacker-news.md
}

if [ $nowSeconds -gt $timeout ]; then
  echo "Timed out!"
  replaceResultsWithStatus "Did not reach the front page :("
  commitToGithub "Timed out!"
  launchctl unload ~/Library/LaunchAgents/com.gc.navel.plist
else
  echo "- gaze.go running"

  ret=$(go run gaze.go)

  EXIT_CODE=$?

  echo -e "$ret"

  # nothing to do if we don't see it
  if [ ${EXIT_CODE} -eq 0 ]; then
    replaceResultsWithStatus "Made it to the front page at $now!"
    commitToGithub "Hit the front page!"

    echo '- running pageres...'
    # "take a picture, it'll last longer"
    pageres https://news.ycombinator.com 1600x900

    echo '- cleaning up with launchctl...'
    launchctl unload ~/Library/LaunchAgents/com.gc.navel.plist
    echo '- done!'
  fi
fi

Only small difference of note here is that I'm also using a utility to take a picture if we hit the front page called pageres. Let the record show I also updated my job to reference the path to pageres, just haven't shown that here. If that happens and I can figure out how to serve static images in gatsby, I'll upload the screen shot.

The final version of gaze.go:

package main
// gaze.go v1

import (
  "fmt"
  "os"
  "time"
  "github.com/peterhellberg/hn"
)

func main() {
  hn := hn.DefaultClient
  fmt.Println("- Scanning the front page of HN for our post...")
  now := time.Now()
  fmt.Printf("Time: %s (%s)\n", now.Format("15:04 EST"),
    now.Format(time.RFC3339))

  ids, err := hn.TopStories()
  if err != nil {
    panic(err)
  }

  fmt.Println("- Checking top 100 posts...")
  for i, id := range ids[:100] {
    item, err := hn.Item(id)
    if err != nil {
      panic(err)
    }

    if (item.Title == "Navel Gazing on Hacker News") {
      fmt.Println(i, "–", item.Title, "-", item.Score)

      if (i < 31) {
        fmt.Println("We were", i, "(front page!)")
        os.Exit(0)
      }
    }
  }

  fmt.Println("Did not find our post!")
  os.Exit(1)
}

The two main adjustments I've made to this script are:

  • I'm scanning the top 100 instead of the top 30, since the API client allows for that and I'm marginally interested in the post trajectory (if it has any 😬)
  • I'm logging what rank it is and the "score" each time I scan

This should make for slightly more useful log data after the fact if I want to review things down the line.

Keeping my computer awake

I think I've configured all the normal "display shutoff" settings to keep my laptop awake, but just to be sure, I'm also going to use an app called Amphetamine as well. That's it! See you in 24 hours. 😊

Results

What happened?

Did not reach the front page :(

Post mortem

I'm mostly unsurprised at the result for a few reasons:

  • That was my account's first post on HN (hadn't gained any karma up until that point)
  • The title isn't that clickbait-y
  • I submitted this on a Saturday a little after 1PM EST, which might be a competitive time to post

It does look like I got a handful of upvotes, but probably not anywhere near enough to crack the front page. But without knowing the actual volume of posts per day, it's hard to say how reasonable a goal that was. Are 10% of posts destined for the top page? 1%? Does it vary by day posted? Time posted? Possible explorations for a future post! 💫

A reasonable question

Q. Why did you bother using go at all? Why not just use curl and some JSON parsing?

A. I read this post and was considering doing the entire thing in go. It looks like Keybase has written a package to interact with launchd that I could have used. My bash scripting is rusty enough that doing everything in go would have gone more smoothly. But I decided against that - it's valuable to have done this using the more widely available set of tools.