When The Pager Goes Off Part 1

John woke slowly, disturbed by the latent buzz of his phone on vibrate. Funny how that noise was almost a ringtone unto itself.

Rolling over and checking his phone confirmed his suspicions: PagerDuty. Opening the message:

Alpha1: disk space full 95%

He crawled out of bed defiantly — to investigate something, that with a little proactive effort, he knew a computer could’ve fixed on its own.

He crawled out of bed, nostalgic for the days that Saturday mornings meant cartoons and breakfast cereal, not fixing something that a computer could do itself.

“Ugh, what a dumb thing to happen in this day and age, he had to be woken up for this?,” John could help muttering to himself “A disk full.” No sense in complaining about it though he decided, it was time to investigate.

John prepared himself for battle with the VPN. Doing the 2 factor auth ritual several times before appeasing the VPN.

Next, SSH’d into the server. Finally he could actually begin.

His first goal was to figure out what was causing this, looking in the PagerDuty alert logs, he could find no alerts since they’d finally enabled monitoring on this sort of thing 8 months ago.

So, what happened today that finally tipped this over instead of any day in the previous 8 months?

“Well maybe there was a huge file somewhere,” John thought. He’d seen that before, some gigantic file that just grows and grows and grows like grey goo.

He always found it easier to find the largest directories first and then look in each one

He knew that it could also be a lot of small files too, which is why he liked starting with a directory hunt. If it was lots of small files or a few large ones, it wouldn’t matter as long as they were in the same place, we could find them.

John runs it as root to make sure that he is considering all of the directories on the filesystem, not just the ones his unprivileged SSH user can read.

sudo du -a / | sort -nr | head -n 10

John wasn’t great at these sorts of incantations, despite what his coworkers though, he’d simply used them a lot of the years so he didn’t need a mnemonic for the esoteric flags.

This one was a bit easier for him, he used sudo all the time and knew he’d have to use it to run as root here to make sure it checks all the files. He then took that and had it sorted based on the numeric value that du returned and in reverse order so he could see the largest. Since he was searching across the whole filesystem, he didn’t want to see everything, just the top 10 in the head of the list.

This gave him the largest 10 directories, from largest down. There wasn’t anything magical about 10, John just liked the round number and figured that for a disk space hunt it was a good place to start.

Exploring each directory, John then works to see what the largest files are in these 10 most wanted directories, running almost exactly the same command again, leaving out the directory so the current one is assumed:

sudo du -a | sort -nr | head -n 10

John starts a clock, the clock is ticking for him anyways, better to be aware of how much has passed.

There was a time when John thought that phrase meant he should feel under pressure, instead he’s now come to understand that he should relax and feel free to explore the environment, just budgeting reasonable amounts of time to do so.

It looks like he was right, looks like the number one offender is a huge application log.

20GB?!? That’s insane!

John doesn’t just want to delete it, though he knows that he can’t just let this thing continue to grow. So John knows that if he stops here he’ll just be dooming some future on-caller to his fate. Who knows it might even be him again.

With this in mind he decides to tackle the issue of making sure this doesn’t happen again in the morning.

In order to solve the problem for now and prevent it in the future, John decides to knock out a quick logrotate file, so that the logs will be taken care of on their own, once they trip certain criteria logrotate allows John to specify when logs should be trimmed, moved, or deleted based on things like size, lines, or date.

John decides to keep this well organized, creating the file as /etc/logrotate.d/app

/var/log/app/*.log {
  daily
  compress
  copytruncate
  rotate 1
  size 100M
}

John decides to take a look at the disk space again, to make sure that everything is ok now:

df -h

Looks like fixing the log gave them the buffer they need.

Now that he’d not only resolved the problem, but prevented it from happening in the future for this box at least, John made some notes about the incident for when he goes off of his on-call shift next week.

He then realizes he was right, he did fix it for this box, but he has a hunch that if they didn’t enable it for this application, that they likely didn’t on any of their other applications either. John then creates a ticket in the backlog so it can be put into their next sprint to use logrotate for their logs on all machines.

“First day on-call and I’ve already been woken up, may as well get on with my day. I wonder what the rest of the week will bring me,” John says to himself as he wanders off to make coffee.