
296: Speedy Performance with Nate Berkopec
Nate Berkopec is the author of the Complete Guide to Rails Performance, the creator of the Rails Performance Workshop, and the maintainer of Puma. He talks with Steph about being known as "The Rails Speed Guy," and how he ended up with that title, publishing content, working on workshops, and also contributing to open source projects. (You could say he's kind of a busy guy!)
Audio is streamed directly from the publisher (aphid.fireside.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
Nate Berkopec is the author of the Complete Guide to Rails Performance, the creator of the Rails Performance Workshop, and the co-maintainer of Puma. He talks with Steph about being known as "The Rails Speed Guy," and how he ended up with that title, publishing content, working on workshops, and also contributing to open source projects. (You could say he's kind of a busy guy!)
- Speedshop
- Puma
- The Rails Performance Workshop
- The Complete Guide to Rails Performance
- How To Use Turbolinks to Make Fast Rails Apps
- Sidekiq
- Follow Nate Berkopec on Twitter
- Visit Nate's Website
- Sign up for Nate's Speedshop Ruby Performance Newsletter
Transcript:
STEPH: All right. I'll kick us off with our fancy intro. Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I’m Steph Viccari. And this week, Chris is taking a break. But while he's away, I'm joined by Nate Berkopec, who is the owner of Speedshop, a Ruby on Rails performance consultancy. And, Nate, in addition to running a consultancy, you're the co-maintainer of Puma. You're also an author as you wrote a book called The Complete Guide to Rails Performance. And you run the workshop called The Rails Performance Workshop. So, Nate, I'm sensing a theme here.
NATE: Yeah, make code go fast.
STEPH: And you've been doing that for quite a while, haven't you?
NATE: Yeah. It's pretty much been since 2015, or so I think. It all started when I actually wrote a blog post about Turbolinks that got a lot of pick up. My hot take at the time was that Turbolinks is actually a good thing. That take has since become uncontroversial, but it was quite controversial in 2015. So I got a lot of pick up on that, and I realized I liked working on performance, and people seem to want to hear about it. So I've been in that groove ever since.
STEPH: When you started down the path of really focusing on performance, were you running your own consultancy at that point, or were you working for someone else?
NATE: I would say it didn't really kick off until I actually published The Complete Guide to Rails Performance. So after that came out, which was, I think, March of 2016…I hope I’m getting that right. It wasn't until after that point when it was like, oh, I'm the Rails performance guy now. And I started getting emails inbound about that. I didn't really have any time when I was actually working on the CGRP to do that sort of thing. I just made that my full-time job to actually write, and market, and publish that. So it wasn't until after that that I was like, oh, I'm a performance consultant now. This is the lane I've driven myself into. I don't think I really had that as a strategy when I was writing the book. I wasn't like, okay, this is what I'm going to do. I'm going to build some reputation around this, and then that'll help me be a better consultant with this. But that's what ended up happening.
STEPH: I see. So it sounds like it really started more as a passion and something that you wanted to share. And it has manifested to this point where you are the speed guy.
NATE: Yeah, I think you could say that. I think when I started writing about it, I just knew...I liked it. I liked the work of performance. In a lot of ways, performance is a much more concrete discipline than a lot of other sub-disciplines of programming where I joke my job is number go down. It's very measurable, and it's very clear when you've made a difference. You can say, “Hey, this number was this, and now it's this. Look what I did.” And I always loved that concreteness of performance work. It makes it actually a lot more like a real kind of engineering discipline where I think of performance engineering as clarifying requirements and the limitations and then building a project that meets the requirements while staying within those limitations and constraints. And that's often not quite as clear for other disciplines like general feature work. It's kind of hard to say sometimes, like, did you actually make the user's life better by implementing such and such? That's more of a guess. That's more of a less clear relationship. And with performance, nobody's going to wake up ten years from today and wish that their app was slower. So we can argue about the relative importance of performance in an application, but we don't really argue about whether or not we made it faster because we can prove that.
STEPH: Yeah. That's one area that working with different teams (as I tend to shift the clients that I'm working with every six months) where we often push hard around feature work to say, “How can we measure this? How can we know that we are delivering something valuable to users?” But as you said, that's really tricky. It's hard to evaluate. And then also, when you add on the fact that if I am leaving that project in six months, then I don't have the same insights to understand how something went for that team. So I can certainly appreciate the satisfaction that comes from knowing that, yes, you are delivering a faster app. And it's very measurable, given the time that you're there, whether it's a short time or if it's a long time that you're with that team.
NATE: Yeah, totally. My consulting engagements are often really short. I don't really do a lot of super long-term stuff, and that's usually fine because I can point to stuff and say, “Yep. This thing was at A, and now it's at B. And that's what you hired me to do, so now it's done.”
STEPH: I am curious; given that you have so many different facets where you are running your consultancy, you are also often publishing a lot of content and working on workshops and then also contributing to open source projects. What does a typical week look like for you?
NATE: Well, right now is actually a decent example. I have client work two or three days a week. And I'm actually working on a new product right now that I'm calling Sidekiq in Practice, which is a course/workshop about scaling Sidekiq from zero to 1000 jobs per second. And I'll spend the other days of the week working on that. My content is...I always struggle with how much time to spend on blogging specifically because it takes so much time for me to come up with a post and publish that. But the newsletter that I write, which I try to write two once a week, I haven't been doing so well with it lately. But I think I got 50 newsletters done in 2020 or something like that.
STEPH: Wow.
NATE: And so I do okay on the per-week basis. And it's all content I've never published anywhere else. So that actually is like 45 minutes of me sitting down on a Monday and being like rant, [chuckles] slam keyboard and rant and then hit send. And my open source work is mostly 15 minutes a day while I'm drinking morning coffee kind of stuff. So I try to spread myself around and do a lot of different stuff. And a lot of that means, I think, pulling back in terms of thinking how much you need to spend on something, especially with newsletters, email newsletters, it was very easy to overthink that and spend a lot of time revising and whatever. But some newsletter is better than no newsletter. And especially when it comes to content and marketing, I've learned that frequency and regularity is more important than each and every post is the greatest thing that's ever come out since sliced bread. So trying to build a discipline and a practice around doing that regularly is more important for me.
STEPH: I like that, some newsletter is better than no newsletter. I was listening to your chat with Brittany Martin on the Ruby on Rails podcast. And you said something very honest that I appreciated where you said, “Writing is really hard, and writing sucks.” And that made me laugh in the moment because even though I do enjoy writing, I still find it very hard to be disciplined, to sit down and make it happen. And then you go into that editor mode where you critique everything, and then you never really get it published because you are constantly fixing it. It sounds like...you've mentioned you set aside about 45 minutes on a Monday, and you crank out some work. How do you work through that inner critic? How do you get past it to the point where then you just publish?
NATE: You have to separate the steps. You have to not do editing and first drafting at the same time. And the reason why I say it sucks and it's hard is because I think a lot of people don't do a lot of regular writing, maybe get intimidated when they try to start. And they're like, “Wow, this is really hard. This is not fun.” And I'm just trying to say that's everybody's experience and if it doesn't get any better, because it doesn't, [chuckles] there's nothing wrong with you, that's just writing, it's hard. For me, especially with the newsletter, I just have to give myself permission not to edit and to just hit send when I'm done. I try to do some spell checking,, and that's it. I just let it go. I'm not going back and reading it through again and making sure that I was very clear and cogent in all my points and that there's a really good flow through that newsletter. I think it comes with a little bit of confidence in your own ideas and your own experience and knowledge, believing that that's worth sharing and that's worth somebody's time, even if it's not a perfect expression of what's in your head. Like, a 75% expression is good enough, especially in a newsletter format where it's like 500 to 700 words. And it's something that comes once a week. And maybe not everyone's amazing, but some of them are, enough of them are that people stay subscribed. So I think a combination of separating editing and first drafting and just having enough confidence and the basis where you have to say, “It doesn't have to be perfect every single time.”
STEPH: Yeah, I think that's something that I learned a while back to apply to my coding process where I had to separate those two steps of where I have to let the creator in me just create and write some code and make it work, and then come back to the editing process, and taking a similar approach with writing. As you may be familiar with thoughtbot, we're big advocates when it comes to sharing content and sharing things that we have learned throughout the week and different projects that we're working on. And often when people join thoughtbot, they're very excited to contribute to the blog. But it is daunting for that first post because you think it has to be this really grand novel. And it has to be something that is really going to appeal to everybody, and it's going to help everyone. And then over time, you learn it's like, oh well, actually it can be this very just small thing that I learned that maybe only helps 20 people, but it still helped those 20 people. And learning to publish more frequently versus going for those grand pieces is more favorable and often more helpful for people.
NATE: Yeah, totally. That's something that is difficult for people at first. But everything in my experience has led me to believe that frequency and regularity is just as, if not more important than the quality of any individual piece of content that I put out. So that's not to say that...I guess it's weird advice to give because people will take it too far the other way and think that means he's saying quality doesn't matter. No, of course, it does, but I think just everyone's internal biases are just way too tuned towards this thing must be perfect. I've also learned we're just really bad judges internally of what is useful and good for people. Stuff that I think is amazing and really interesting sometimes I'll put that out, and nobody cares. [chuckles] And the other stuff I put out that's just like the 45-minute banging out newsletter, people email me back and say, “This is the most helpful thing anyone’s ever read.” So that quality bias also assumes that you know what is good and actually we're not really good at that, knowing every time what our audience needs is actually really difficult.
STEPH: That's totally fair. And I have definitely run into that too, where I have something that I'm very proud of and excited to share, and I realize it relates to a very small group of people. But then there's something small that I do every day, and then I just happen to tweet about it or talk about it, and suddenly that's the thing that everybody's really excited about. So yeah, you never know. So share it all.
NATE: Yeah. And it's important to listen. I pay attention to what people get interested in from what I put out, and I will do more of that in the future.
STEPH: You mentioned earlier that you are working on another workshop focused on Sidekiq. What can you tell me about that?
NATE: So it's meant to be a guide to scaling Sidekiq from zero to 1000 requests per second. And it's meant to be a missing guide to all the things that happen, like the situations that can crop up operationally when you're working on an application that does a lot of work with Sidekiq. Whereas Mike Sidekiq, Wiki, or the docs are great about how do, you do this? What does this setting mean? And the basics of getting it just running, Sidekiq in practice, is meant to be the last half of that. How do you get it to run 1,000 jobs per second in a day-to-day application? So it's the collected wisdom and collected battle scars from five years of getting called in to fix people's Sidekiq installations and very much a product of what are the actual problems that people experience, and how do you fix and deal with those? So stuff about memory and managing Sidekiq memory usage, how to think about queues. Like, what should your queue structure be? How many should you have? Like, how do you organize jobs into queues, and how do you deal with problems like some client is dropping 10,000, 20,000 jobs into a queue. And now the other jobs I put in that queue have 20,000 jobs in front of them. And now this other job I've got will take three hours to get through that queue. How do you deal with problems like that? All the stuff that people have come to me over the years and that I've had to help them fix.
STEPH: That sounds really great. Because yeah, I find that teams who are often in this space with Sidekiq we just let it run until there's a fire. And then suddenly, we start to care as to how it's processing, and we care about our queue structure and how many workers that we have that are pulling from that queue. So that sounds really helpful. When you're building a workshop, do you often go back to any of those customers and pull more ideas from them, or do you find that you just have enough examples from your collective work with clients that that itself creates a course?
NATE: Usually, pretty much every chapter in the workshop I've probably implemented like three-plus times, so I don't really have to go back to any individual customer. I have had some interesting stuff with my current client, Gusto. And Gusto is going through some background job reorganization right now and actually started to implement a lot of the things that I'm advocating in the workshop actually without talking to me. It was a good validation of hey, we all actually think the same here. And a lot of the solutions that they were implementing were things that I was ready to put down into those workshops. So I'd like to see those solutions implemented and succeed. So I think a lot of the stuff in here has been pretty battle-tested.
STEPH: For the Rails Performance Workshop, you started off doing those live and in-person with teams, and then you have since switched to now it is a CLI course, correct?
NATE: That's correct. Yep.
STEPH: I love that very much. When you’ve talked about it, it does feel very appropriate in terms of developers and how we like to consume content and learn. So that is really novel and also, it seems like a really nice win for you. So then other people can take this course, but you are no longer the individual that has to deliver it to their team, that they can independently take the course and go through it on their own. Are you thinking about doing the same thing for the Sidekiq course, or what are your plans for that one?
NATE: Yeah, it's the exact same structure. So it's going to be delivered via the command line. Although I would say Sidekiq in practice has more text components. So it's going to be a combination of a very short manual or book, and some video, and some hands-on exercises. So, an equal blend between all three of those components. And it's a lot of stuff that I've learned over having to teach; I guess intermediate to advanced programming concepts for the last five years now that people learn at different paces. And one of the great things about this kind of format is you can pick it up, drop it off, and move at your own speed. Whereas a lot of times when I would do this in person, I think I would lose people halfway through because they would get stuck on something that I couldn't go back to because we only had four hours of the day. And if you deliver it in a class format, you're one person, and I've got 24 other people in this room. So it's infinitely pausable and replayable, and you can go back, or you can just skip ahead. If you've got a particular problem and you're like, hey, I just want to figure out how to fix such and such; you can do that. You can just come in and do a particular thing and then leave, and that's fine. So it's a good format that way. And I've definitely learned a lot from switching to pre-recorded and pre-prepared stuff rather than trying to do this all live in person.
STEPH: That is one of the lessons that I've learned as well from the couple of workshops that I've led is that doing them in person, there's a lot of energy. And I really enjoy that part where I get to see people respond to the content. And then I get a lot of great feedback from people about what type of questions they have, where they are getting stuck. And that part is so important to me that I always love doing them live first. But then you get to the point, as you'd mentioned, where if you have a room full of 20 people and you have two people that are stuck, how do you help them but then still keep the class going forward? And then, if you are trying to tailor this content for a wide audience…so maybe beginners could take the Rails Performance Workshop, or they could also take the Sidekiq course. But you also want the more senior engineers to get something out of it as well. It's a very challenging task to make that content scale for everyone.
NATE: Yeah. What you said there about getting feedback and learning was definitely something that I got out of doing the Rails Performance Workshop in person like three dozen times, was the ability to look over people's shoulders and see where they got stuck. Because people won't email me and say, “Hey, this thing is really confusing.” Or “It doesn't work the way you said it does for me.” But when I'm in the same room with them, I can look over their shoulder and be like, “Hey, you're stuck here.” People will not ask questions. And you can get past that in an in-person environment. Or there are even certain questions people will ask in person, but they won't take the time to sit down and email me about. So I definitely don't regret doing it in person for so long because I think I learned a lot about how to teach the material and what was important and how people...what were the problems that people would encounter and stuff like that. So that was useful. And definitely, the Rails Performance Workshop would not be in the place that it is today if I hadn't done that.
STEPH: Yeah, helping people feel comfortable asking questions is incredibly hard and something I've gone so far in the past where I've created an anonymous way for people to submit questions. So during class, even if you didn't want to ask a question in front of everybody, you could submit a question to this forum, and I would get notified. I could bring it up, and we could answer it together. And even taking that strategy, I found that people wouldn't ask questions. And I guess it circles back to that inner critic that we have that's also preventing us from sharing knowledge that we have with the world because we're always judging what we're going to share and what we're going to ask in front of our peers who we respect. So I can certainly relate to being able to look over someone's shoulder and say, “Hey, I think you're stuck. We should talk. Let me walk you through this or help you out.”
NATE: There are also weird dynamics around in-person, not necessarily in a small group setting. But I think one thing I really picked up on and learned from RailsConf2021 which was done online, was that in-person question asking requires a certain amount of confidence and bravado that you're not...People are worried about looking stupid, and they won't ask things in a public or semi-public setting that they think might make them look dumb. And so then the people that do end up asking questions are sometimes overconfident. They don't even ask a question. They just want to show off how smart they are about a particular issue. This is more of an issue at conferences. But the quality of questions that I got in the Q&A after RailsConf this year (They did it as Discord chats.) was way better. The quality of questions and discussion after my RailsConf talk was miles better than I've ever had at a conference before. Like, not even close. So I think experimenting with different formats around interaction is really good and interesting. Because it's clear there's no perfect format for everybody and experimenting with these different settings and different methods of delivery has been very useful to me.
STEPH: Yeah, that makes a ton of sense. And I'm really glad then for those opportunities where we're discovering that certain forums will help us get more feedback and questions from people because then we can incorporate that and to future conferences where people can speak up and ask questions, and not necessarily be the one that's very confident and enjoys hearing their own voice. For the Rails Performance Workshop, what are some of the general things that you dive into for that workshop? I'm curious, what is it like to attend that workshop? Although I guess one can't attend it anymore. But what is it like to take that workshop?
NATE: Well, you still can attend it in some sense because I do corporate bookings for it. So if you want to buy 20 seats, then I can come in and basically do a Q&A every week while everybody takes the workshop. Anyway, I still do that. I have one coming up in July, actually. But my overall approach to performance is to always start with monitoring. So the course starts with goals and monitoring and understanding where you want to go and where you are when it comes to performance. So the first module of the Rails Performance Workshop is actually really a group exercise that's about what are our performance requirements and how can we set those? Both high-level and low-levels. So what is our goal for page load time? How are we going to measure that? How are we going to use that to back into lower-level metrics? What is our goal for back-end response times? What is our goal for JavaScript bundle sizes? That all flows from a higher-level metric of how fast you want the page to load or how fast you want a route to change in a React app or something, and it talks about those goals. And then where should you even start with where those numbers should be? And then how are you going to measure it? What are the browser events that matter here? What tools are available to help you to get that data? Because without measurement, you don't really have a performance practice. You just have people guessing at what stuff is faster and what is not. And I teach performance as a scientific process as science and engineering. And so, in the scientific method, we have hypotheses. We test those hypotheses, and then we learn based on those tests of our hypotheses. So that requires us to A, have a hypothesis, so like, I think that doing X makes this faster. And I talk about how you generate hypotheses using profiling, using tools that will show you where all the time goes when you do this particular operation of your software—and then measuring what happens when you do that? And that's benchmarking. So if you think that getting rid of method X or changing method X will speed up the app, benchmarking tells you did you actually speed it up or not? And there are all sorts of little finer points to making sure that that hypothesis and that experiment is tested in a valid way.
I spend a lot of time in the workshop yapping about the differences between development/local environments and production environments and which ones matter. Because what differences matter, it's not often the ones that we think about, but instead it's differences like actually in Rails apps the asset packaging and asset pipeline performs very differently in production than it does in development, works very differently. And it makes it one of the primary reasons development is slower than production, so making sure that we understand how to change those settings to more production-like settings. I talk a lot about data. It’s the other primary difference between development and production is production has a million users, and development has 10. So when you call things like User.all, that behavior is very different in production than it is locally. So having decent production-like data is another big one that I like to harp on in the workshops. So it's a process in the workshop of you just go lesson by lesson. And it's a lot of video followed up by hands-on exercises that half of them are pre-baked problems where I'm like, hey, take a look at this Turbolinks app that I've given you and look at it in DevTools. And here's what you should see. And then the other half is like, go work on your application. And here are some pull requests I think you should probably go try on your app. So it's a combination of hands-on and videos of the actual experience going through it.
STEPH: I love how you start with a smaller application that everyone can look at and then start to learn how performant is this particular application that I'm looking at? Versus trying to assess, let’s say, their own application where there may be a number of other variables that they have to consider. That sounds really nice. You'd mentioned one of the first exercises is talking about setting some of those goals and perhaps some of those benchmarks that you want to meet in terms of how fast should this page load, or how quickly should a response from the API be? Do you have a certain set of numbers for those benchmarks, or is it something that is different for each product?
NATE: Well, to some extent, Google has suddenly given us numbers to work with. So as of this month, I think, June 2021, Google has started to use what they're calling Core Web Vitals in their ranking of search results. They've always tried to say it's not a huge ranking factor, et cetera, et cetera, but it does exist. It is being used. And that data is based on Chrome user telemetry. So every time you go to a website in Chrome, it measures three metrics and sends those back to Google. And those three metrics are Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS). And First Input Delay and Cumulative Layout shift are more important for your single-page apps kind of stuff. It's hard to screw those up with a Golden Path Rails app that just does Turbolinks or Hotwire or whatever. But Largest Contentful Paint is an easy one to screw up. So Google's line in the sand that they've drawn is 2.5 seconds for Largest Contentful Paint. So that's saying that from clicking on your website in a Google search result, it should take 2.5 seconds for the page to paint the largest component of that new page. That's often an image or a video or a large H1 tag or something like that. And that process then will help you to...to get to 2.5 seconds in Largest Contentful Paint; there are things that have to happen along the way. We have to download and execute all JavaScript. We have to download CSS. We have to send and receive back-end responses. In the case of a simple Hotwire app, it's one back-end response. But in the case of a single-page app, you got to download the document and then maybe download several XHR fetches or whatever. So there's a chain of events that has to happen there. And you have to walk that back now from 2.5 seconds in Largest Contentful Paint. So that's the line that I'm seeing getting drawn in the sand right now with Google's Core Web Vitals. So pretty much any meaningful web application performance metric can be walked back from that.
STEPH: Okay. That's super helpful. I wasn't aware of the Core Web Vitals and that particular stat that Google is using to then rank the sites. I was going to ask, this kind of blends in nicely into when do you start caring about performance? So if you have a new application that you are just starting to get to market, based on the fact that Google is going to start ranking you right away, you do have to care some right out of the gate. But I am curious, when do you start caring more about performance, and are there certain tools and benchmarking that you want to have in place from day one versus other things that you'll say, “Well, we can wait until we have X numbers of users or other conditions before we add more profiling?”
NATE: I'd say as an approach, I teach people not to have a performance strategy of monitoring. So if your strategy is to have dashboards and look at them regularly, you're going to lose. Eventually, you're not going to look at that dashboard, or more often, you just don't understand what you're looking at. You just install New Relic or Datadog or whatever, and you don't know how to turn a dashboard into actual action. Also, it seems to just wear teams out, and there's no clear mechanism when you just have a dashboard of turning that into oh, well, this has to now be something that somebody on our team has to go work on. Contrast that with bugs, so teams usually have very defined processes around bugs. So usually, what happens is you'll get an Exception Notification through Sentry or Bugsnag or whatever your preferred Exception Notification service is. That gets read by a developer. And then you turn that into a Jira ticket or a Kanban board or whatever. And then that is where work is done and prioritized. Contrast that with performance; there’s often no clear mechanism for turning metrics into stuff that people actually work on. So understanding at your organization how that's going to work and setting up a process that automatically will turn performance issues into actual work that people get done is important.
The way that I generally teach people to do this is to focus instead of dashboards and monitoring, on alerts, on automated thresholds that get tripped and then sends somebody's an email or put something in the Kanban board or whatever. It just has to be something that automatically gets fired. Different tools have different ways of doing this. Datadog has pretty much built their entire product around monitoring and what they call monitors. That's a perfectly fine way to do it, whatever your chosen performance monitoring tool, which I would say is a required thing. I don't think there's really any good excuse in 2021 for not having a performance monitoring tool. There are a million different ways to slice it. You can do it yourself with OpenTelemetry and then like statsD, I don't know, or pay someone else like everyone else does for Datadog or New Relic or AppSignal or whatever. But you got to have one installed. And then I would say you have to have some sort of automated alerting. Now that alerting means that you've also decided on thresholds. And that's the hard work that doesn't get done when your strategy is just monitoring. So it's very easy to just install a dashboard and say, “Hey, I have this average page time load dashboard. That means I'm paying attention to performance.” But if you don't have a clear answer to what number is good and what number is bad, then that dashboard cannot be turned into real action. So that's why I push monitoring so hard is because it allows people to ignore performance is all that matters, and it forces you to make the decision upfront as to what number matters. So that is what I would say, install some kind of performance monitoring. I don't really care what kind.
Nowadays, I also think there's probably no excuse to not have Real User Monitoring. So there's enough GDPR compliance Real User Monitoring now that I think everyone should be using it. So for industry terms, Real User Monitoring is just performance monitoring in the browser. So it's just users’ browser APIs and sends those back to you or your third-party provider, so having that so you actually are collecting back-end and front-end performance metrics. And then making decisions around what is bad and what is good. Probably everybody should just start with a page load time monitor, Largest Contentful Paint monitor. And if you've got a single-page app, probably hooking up some stuff around route changes or whatever your app...because you don't actually have page loads on every single time you navigate. You have to instrument whatever those interactions are. So having those up and then just drawing some lines that say, “Hey, we want our React route changes to always be one second or less.” So I will set an alert that if the 95th percentile is one second or more, I'm going to get alerted. There's a lot of different ways to do that, and everybody will have different needs there. But having a handful of automated monitors is probably a place to start.
STEPH: I like how you also focus on once you have decided those thresholds and have that monitoring in place, but then how do you make it actionable? Because I have certainly been part of teams where we get those alerts, but we don't necessarily...what you just mentioned, prioritize that work to get done until we have perhaps a user complaint about it. Or we start actually having pages that are timing out and not loading, and then they get bumped up in the priority queue. So I really like that idea that if we agree upon those thresholds and then we get alerted, we treat that alert as if it is a user that is letting us know that a page is too slow and that they are unable to use our application, so then we can prioritize that work.
NATE: And it's not all that dissimilar to bugs, really. And I think most teams have processes around correctness issues. And so, all that my strategy is really advocating for is to make performance fail loudly in the same way that most exceptions do. [chuckles] Once you get to that point, I think a lot of teams have processes around prioritization for bugs versus features and all that. And just getting performance into that conversation at least tends to make that solve itself.
STEPH: I'm curious, as you're joining teams and helping them with their performance issues, are there particular buckets or categories of performance issues that are the most common in terms of, let's say, 50% of issues are SQL-related N+1 issues? What tends to be the breakdown that you see?
NATE: So, when it comes to why something is slow in a Ruby application, I teach a method that I call DRM. And that doesn't have anything to do with actual DRM. It's just memorable because it reminds me of things I don't like. DRM stands for Database Ruby and Memory and in that order. So the most common issue is database, the second most common issue is issues with your Ruby code. The least common issue is memory. Specifically, I'm talking about allocation of objects, creating lots of objects. So probably 80% of your issues are in some way database-related. In Rails, it's 50% of those are probably N+1. And then 30% of database issues are probably what I would call unnecessary SQL. So it's not necessarily N+1, but it's a SQL query for information that you already had, or you could do in a more efficient way. So a common thing for unnecessary SQL would be people will filter an ActiveRecord::Collection like ten different ways when they could have just loaded the whole collection, filtered it with Ruby in the ten different ways afterwards, and that works really well if the collection that you're loading is like 10, 20. Turning that into one database query, plus a bunch of calls to innumerable methods is often way faster than doing that as ten separate database queries. Also, that tends to be a more robust approach. This doesn't happen in most companies, but what could happen is the database is like a shared resource. It's a resource that everybody is affected by. So a performance degradation to the database is the worst possible scenario because everything is affected. But if you screw up what's happening at an individual Rails process, then only that Rails process is affected. The blast radius is tiny. It's just that one request. So doing less stuff in the database while it can actually seem like, oh, that doesn't feel right. I'm supposed to do a lot of stuff in the database. It actually can reduce blast radiuses of performance issues because you're not doing it on this database that everyone has to have access to. There are a lot of areas of gray here. And I talk a lot in all my other material like why -- There's a lot of nuance here.
So database is the main stuff. Issues in how you write your Ruby code is probably the other one. Usually, that's just what I would call code that goes bump in the night. It's code that you don't know is running but actually is. Profilers are what help us figure that out. So oftentimes, I'll have someone open up a profiler on their controller action for the first time. And they're like, wait a minute, I had no idea that such and such was running during this controller action, and actually, we don't need to do that at all. So why is it here? So that's the second most common issue. And then the third issue that really doesn't actually come up all that often is object allocation, numbers of objects that get created. So primarily, this is a problem in index actions or actions transactions that deal with big collections. So in Ruby, we often get overly focused on garbage collection, but garbage collection doesn't take any time if you just don't create objects. And object creation itself takes time. So looking at code through the lens of what object does this code create? And trying to get rid of those object allocations can often be a pretty productive way to make stuff faster.
STEPH: You said a lot of amazing things there. So I'm debating on which one to follow up on. I think the one that stuck out to me the most where I have felt pain around this is you mentioned identifying code that goes bump in the night or code that is running, but it doesn't need to be run. And that is something that I've run into with applications where we have a code path that seems important, but yet I can't prove that it's being executed and exactly why it's there and what flow it's supporting. And I'm curious, do you have any tips or tricks in how you’ve helped teams identify that this code path isn't used and it's something that we can remove and then that itself will help speed up the performance of that particular endpoint?
NATE: Like, there's no performance cost to like 100 models in an application that never actually get used. There's really no performance downside to code in an app that doesn't actually ever get run. But instead, what happens is code gets added into callbacks that usually is probably the biggest offender that’s like, always do this thing after you do X. But then, two years later, you don't always need to do that thing after you do X. So the callbacks always run, but sometimes requirements change, and they don't always need to be run. So usually, it's enough to just pop the profiler now on something. And I have people look at it, and they're like, “I don't know why any of this is happening.” Like, it's usually a pretty big Eureka moment once we look at a flame graph for the first time and people understand how to read those, and they understand what they're looking at. But sometimes there's a bit of a process where especially in a bigger app where it's like, “Such and such is running, and this was an entire other team that's working on this. I have no idea what this even does.” So on bigger apps, there's going to be more learning that has to get done there. You have to learn about other parts of the application that maybe you've never learned about before. But profiling helps us to not only see what code is running but also what that relative importance is. Like, okay, maybe this one callback runs, and you don't know what it does, and it's probably unnecessary. But if it only takes 1% of the total time to run this action, that's probably less important than something that takes 20% of total time. And so profilers help us to not only just see all the code that's being run but also to know where that time goes and what time corresponds to what parts of the code.
STEPH: Yeah, that's often the code that makes me the most nervous is where it's code that I suspect is being run or maybe being run, but I don't understand why it's there and then figuring out if it can be removed and then figuring out ways to perhaps even log when a call is being made to that code to determine if it's truly in use or not or at least supported by a code path that a user is hitting. You have a blog post that I read recently that I really appreciated that talks about essentially gaming benchmarking where you talk about the importance of having context around benchmarks. So if someone says, “I've improved something where it is now 10% faster.” It's like, well, what is that 10% relative to? And if it's a tool that other people are using, what does that mean for them? Or did you improve something that was already very fast, and you made it 10% faster? Was that a really valuable use of your time?
NATE: Yeah. You know, something that I read recently that made me think of that again was this Hacker News post that went viral. That was like, how I optimize an AWS EC2 instance to take 1.5 million requests per second on my JSON API. And out of the box, it was like 500 requests per second, and then he got it to 1.5 million. And the whole article was presented with relative numbers. So it was like, “I made this change, and things got 33% faster. And if you do the whole thing right, 500 to 1.5 million requests per second, it's like my app is three times faster now,” or whatever. And that's true, but it would probably be more accurate to say, “I've taken three-millionth of a second out of every request in my app.” That's two ways of saying the same thing because latency and throughput are just related that way. But it's probably more accurate and more useful to say the absolute number, but it doesn't make for great blog posts, so that doesn't tend to get said. The kinds of improvements that were discussed in this article were really, really low-level stuff. That was like if you turn off...I think it was like turn off iptables or something like that. And it's like, that shaves a microsecond off of every time we make a syscall or something. And that is useful if your performance goal is to serve 1.5 million requests per second Hello World responses off of my EC2 instance, which is what this person admittedly was doing. But there's a tendency to walk that back to if I do all things in this article, my application will be three times faster. And that's just not what the evidence says. It's not what you were told. So there's just a tendency to use relative numbers when absolute numbers would be more useful to giving you the context of like, oh, well, this will improve my app or it won't. We get this a lot in Puma. We get benchmarks that are like, hey, this thing is going to help us to do 50,000 requests per second in Puma instead of 10,000. And another way of saying that is you took a couple of nanoseconds off of the overhead of every single request to Puma. And most Puma applications have a hundred millisecond response time. So it's like, yeah, I guess it's cool that you took a nanosecond off, and I’m sure it's going to help us have cool benchmarks, but none of our users are going to care. No one that's used Puma is going to care that their requests are one nanosecond faster now. So what did we really gain here?
STEPH: Yeah, it makes sense that people would want to share those more...I want to call them sparkly stats and something that catches your attention, but they're not necessarily something that's going to translate to us in the way that we hoped that they will in terms of it's not going to speed up our app 30% or have those same rewards or benefits. Speaking of Puma, how is it being a co-maintainer of Puma? And how do you balance that role with all of your other work?
NATE: Actually, it doesn't take all that much of my time. I try to spend about 15 minutes a day on it. And that's really possible because of the philosophy I have around open-source maintenance. I think that open source projects are fundamentally about collaboration and about sharing our hard-fought extractions and fixes and knowledge together. And it's not about a single super contributor or super maintainer who is just out of the goodness of their heart releasing all of their incredible work and time into the public domain or into a free software license. Puma is a pretty popular piece of Ruby software, so a lot of people use it. And I have things on my back burner of if I ever got 20 hours to work on Puma, here’s stuff I would do. But there are a lot of other people that have more time than me to work on Puma. And they're just as smart, and they have other tools they've got in their locker that I don't have. And I realized that it was more important that I actually find ways to recruit and then unblock those people than it was for me to devote as much time as I could to Puma. And so my work on Puma now is really just more like management than anything else. It's more trying to recruit new contributors and trying to give them what they need to help Puma. And contributing to open source is a really fraught experience for a lot of people, especially their first time. And I think we should also be really conscious of that. Like, 95% of software developers have really never contributed to open source in a meaningful way. And that's a huge talent pool of people that could be helping us that aren't. So I'm less concerned about the problems of the 5% that are currently contributing than I am about why there are 95% of us that don't do anything. So that's what gets me excited to work on Puma now, is trying to change that ratio.
STEPH: I really like that mindset of where you are there to provide guidance but then essentially help unblock others as they're making contributions to the project but then still be there to have the history and full context and also provide a path forward of a good direction for Puma to head. In regards to encouraging mo