Software Engineering at Google

Title: Software Engineering at Google
Date: February 3, 2022
Duration: 1HR

SPEAKER
Titus Winters, Senior Staff Software Engineer, Google

MODERATOR
Hyrum Wright, Senior Staff Software Engineer, Google

Registration Link

Defensive C++: Software Engineering Principles & Types of Errors (Skillsoft course, free for ACM Members)
Interdisciplinary Approaches to Information Systems and Software Engineering (Skillsoft book, Free for ACM Members)
Patterns in the Machine: A Software Engineering Guide to Embedded Development (Skillsoft book, free for ACM Members)
Software Engineering at Google (O’Reilly book, Free for ACM Members)
Modern Software Engineering: Doing What Works to Build Better Software Faster (O’Reilly book, free for ACM Members)

It’s my pleasure to listen to this lecture. I would love to know what the standard for a software engineer is at a big company like Google.

I am a freshman at a very small liberal arts college in Kentucky. I have been trying to apply for internships and learn what I can but have been unsuccessful. Do you have any advice on getting myself “seen” as well as being able to show people that I can provide value?

I signed up for this event a while ago, but both the registration and webcast links are dead links, and the event is planned to start now.

So, the link I was given for this talk is not working… Anyone else with the same problem?

Please try again, perhaps in a different browser. It should work.

Sorry about your technical difficulties. The link should work (perhaps try a different browser). It is also being recorded in case you are still unable to get in.

To anyone having issues, try restarting your computer. That worked for me.

This talk was a pleasure to listen to. I really appreciate the effort by ACM Learning. I am into research that requires me to work with fairly large frameworks and codebases. Would like to know more on better managing large-scale software engineering tasks.

Q: Is C++ good as a first language to learn?
A: Not really. As an educator I’d recommend something as simple as possible - the most important first step is to learn to be an algorithmic thinker. The more you can get language complexity out of the way while tackling that task, the better. I’m personally a fan of Python for intro programming, although eventually something like Rust, Go, or C++ is important for back-end / systems programming.
Q: Did u have experience before applying to your first software development job?
A: My first programming jobs were back in the late 90s and were very web-development focused. I had already been programming on my own for maybe 5 years by that point.
Q: In academia, I have been pounding my fist for years regarding the importance and value of a Software Engineering undergraduate degree, in contrast to Computer Science. Why do you think we do not see lots of SE degree programs in the US? Do you also believe there is a need/opportunity for higher ed to provide authentic SE degrees?
A: If SE is programming, time, and teamwork, it’s inherently challenging to make it authentic in an undergraduate curriculum. It’s hard to have projects or lessons where time becomes the dominant factor when we only have a couple months. Working on a team of people with the same background / experience is also awkward and inauthentic. I think the material we can/should teach undergrads can be improved on these points, but it’s hard. You’ll get more (and more authentic) experience in the first month on the job than you will in a class.

Q: Seemingly offhand, our enterprise production manager has determined that we cannot afford the man-hours for simply upgrading our campus cluster’s operating system (running at centos 6.5) to support the installation of a container platform (Singularity). How do I/we go about assembling a business case for that upgrade?

A: I love citing security vulnerabilities for this. With Heartbleed, Spectre, Meltdown, log4j, etc all making the news it isn’t too hard to show that even the most common tech has the potential to be vulnerable. Even if there isn’t a published vulnerability today, that’s no promise for the future. Really you’ve got three options: stay current (many small/cheap upgrades), upgrade when it’s an emergency (one large, risky, and expensive upgrade), or don’t upgrade (risking a lot). From a business perspective, avoiding risk AND having a known set of costs are both valuable.

Q: Regarding Hyrum;s Law, if an API consumer depends upon observable behavior not covered by the API specification/documentation, isn’t the onus on the consumer to make changes to their client code when that behavior changes?
A: Yes and no. In a monorepo or CI/CD world, proving that a given breakage is because of inappropriate use still takes effort. And more often than not the breakage isn’t limited to the team that violated that contract - all of their users are also affected. But even in a totally distributed model, changing anything that people believed was working is going to cause some grumbling and reputation cost. It’s best to try to mitigate it from the start.

Q: What should i do first to become a software engineer?
A: Learn to program. Then practice reading code, technical communication, and fixing bugs in code you didn’t write.
Q: What would be a good way to update projects that have 3rd party dependencies which themselves depend on old language versions with deprecated features?
A: There isn’t a cheap/easy answer to that. Either try to drop those 3p dependencies, or try to upgrade them and contribute those patches upstream.

Q: Do software engineers of large scale systems have the luxury to try new things (language support, library update) for better performance? Might not be merged but exploring the possibilities of new implementations.
A: Oh, absolutely. A surprising amount of our infrastructure revolves around “How can we run experiments safely?” and “Is X better performance than Y?”
Q: From the perspective of applying these principles, what do you see as the primary difference between software and hardware engineering.
A: Hardware scares me, specifically because the release cycle is so much longer. We know that high-performing software teams have fast release times and can release several times a week or more. Getting new hardware versions on a daily basis sounds like a recipe for chaos - so everything we know about the software process needs to be applied to hardware just prior to the “send the design to the factory” step. The stakes are just a lot higher, but that also means attention to quality process is at least as important.

Q: There is a plethora of software quality metrics such as cyclometric complexity, Halstead complexity measures, function points etc., yet it is very difficult to get a clear signal on software quality, since most of them are either not too informative or easily misleading. Which metrics do you recommend putting emphasis on?
A: I think the union of all of those tends to give some signal, but we still run the risk of streetlight effects. Those sort of static metrics capture one form of complexity, but dynamic operations (like microservices and production environments) can’t generally be tracked by those. And those dynamic things tend to be harder/scarier. In the end, I think asking all of the devs/engineers on the project “Where do we have the most technical debt / unnecessary complexity?” is probably a better signal - it’s not quantitative/objective, but it does drawn from reality.
Q: How can the team that triggers the change and has to do the bulk of the work, work to make the necessary changes in systems and code bases they are not familiar with?
A: Local invariants and reasoning. Obviously you can’t replace an airplane with a minivan - the replacements have to be basically similar. And if you know “this is an airplane like X, but now it’s green and the cabin door is 500cm higher” it’s not really hard to figure out how to swap it out, even without knowing much about the local airport.
Q: What does he mean by Sub-Linear?
A: Assume you assign 1 person on a 10 person project to do one piece of the work. Now the work gets 10x larger and the team gets 10x larger - do you get 10 people to do that work? Or does it take 20? Or can you do it in 5? You want to do it in 5, but that requires automation, expertise, consistency, etc.

Q: How do you reduce dependence on “long lived dev branches”
A: Feature flags, build from trunk, and commit in small pieces. (See the ending chapters in the Flamingo book or some of the flags/experiments/release chapters in the SRE books).
Q: On the slide for shifting left, why are “Unit Tests” listed as a post-submit test, rather than as a pre-submit test?
A: Should be both, really.
Q: If i understood it correct, you said that there is a (scientifically validated) publication, that working in the trunk leads to better results then working in branches. Could you please provide the link to this publication?
A: My favorite citation for this (there are several) is the book “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations” by Forsgren et al.

Q: What strategies do you find useful to encourage small fry organizations to embrace these principles? I often hear a justification “We are not Google.”
A: I think “We are not Google” is fair - most of the things that would kill us at scale would be an annoyance to others. (Like, deciding on a merge strategy in a company-wide weekly meeting.) In the book we tried to be as clear as we could about what tradeoffs are involved in adopting any given policy/practice/technology. I don’t really want to say,“Do it this way, we know best,” I want y’all to honestly evaluate it and pick what will be best in your context. But I don’t see that: so many shops seem very fixated on short-term cost-cutting and ignore the long-term productivity implications. Maybe start small, “We are not Google, we don’t have a massive build farm. Buy us faster machines, we spend many times more on payroll than on computers.” :slight_smile:
Q: How did you become confident in your own abilities?
A: Wait, did I? I don’t feel confident. (Honestly, imposter syndrome is very significant in Google and across the industry.)
Q: What would you recommend a college student do to better their chances at getting a software engineering job?
A: Practice. I think it honestly takes 2-3x more hours of programming drill and practice to become a fluent programmer than are required in a CS program. It’s at least as complex as learning to read - and it takes the same sort of time commitment and commitment to practice.
Q: Can you show that last quote again!
A: It’s programming if clever is a compliment.
It’s software engineering if clever is an accusation.
Q: What is canary?
A: Rather than release a new version of the software to all users all at once, we do gradual rollouts. These “canary” releases are then checked to see if they use similar resources, don’t crash, produce similar results, etc so we get some ground-truth that the new potential release is “good.” (See the SRE book.)
Q: How do you propagate these values across the corporation, with employee “churn”?
A: By writing things down and giving the same talks many, many times. It might be enough, but it’s definitely a challenge.
Q: With incremental development and Agile all the rage, how do you balance upfront design to take advantage of affordability of those phases versus the need to develop incrementally?
A: I don’t think every small incremental piece needs the same design attention, but we do need to design the big components first. Then those can be broken up in several Agile sprints (in parallel or series). I don’t see Agile as conflicting with any of this.
Q: Can you remind me, concisely, how expertise turns scale problems into a benefit? Or what you really said?
A: If you’ve got 10 people working on a project, having one of them be a superstar in (language, testing, design, graphics, whatever) is limited to what they can do directly + the influence they have on 9 people. But if you’ve got 100 people, that expert can influence 99 others - the balance starts to shift to having more potential impact through education and influence, but that depends on having scale to start with.
Q: What are the most interesting challenges you came across during your time at Google?
Q: GitHub Code Reviews seem to rely exclusively on branch-based Pull Requests. Do you have any suggestions on building good Code Review practices around trunk-based development?
A:Short lived branches aren’t a problem - every commit to git or any other version control system is morally equivalent to a short-lived branch. The real concern is to ensure that everyone knows to commit to trunk, and to only depend on the version in trunk (not someone else’s work in flight).

Q: How’s the work life balance @ google?
A: It varies, Google is a huge place. Down in the areas where I work it’s very good and our management is very supportive. I can’t speak with any authority about the rest of the company - it’s too big to be consistent.
Q: Having extensive experience in the field, do you have any tips for those who are just starting their careers?
A: Practice, practice, practice. Read, watch talks, and consciously practice.
Q: Thoughts on monorepo based development?
A: Wildly in favor. I can’t imagine scaling up to even 100 people without something like this. But it doesn’t have to specifically be one repo, just like a filesystem can be composed of multiple storage devices - it’s the usage model, not the implementation that matters.
Q: How do you think software development life cycle management will change as a result of the exponential increase in cybersecurity attacks?
A: I expect more reliance on property-based testing, fuzzing, dynamic analysis, and test coverage. In most domains I see those approaches as being the sweet spot in the space between formal methods proofs of correctness vs. ad hoc test case generation (or no tests).
Q: : How would you suggest addressing a need for change where you have highly entrenched features that are “bad” or negative for future customers but absolutely dependent for existing customers?
A: It depends a lot on context. Sometimes you can cut a final legacy release for those existing customers and move that legacy branch to be only maintenance or fully unsupported. Sometimes you can get those existing customers to explicitly opt-in to the legacy behavior, and then you can provide the new behavior as a default for new customers (which is much cleaner). In most/all cases you’ll have to find a way to have two versions of the behavior, either separated in time (distinct releases) or switched per user configuration, etc. It’s hard.
Q: A “people” question. Developer, as any other population, respond to incentives. How to reward great developers? Classic problem: how to differentiate programmers who cause and fix many bugs vs. from those who seldom write buggy code in the first place? The latter folks often go unnoticed.
A: In theory this should be handled by having peers as part of performance reviews - if your peers are fed up with your hacks and buggy code, they’re gonna say something. In practice, I’m not sure we know exactly how to handle that rationally. You’re a lot more likely to get attention/funding/praise for heroically rescuing the company from an outage (that you may have contributed to) than to just consistently doing quiet solid work.
Q: Should developers test and debug their codes in a container-based environment?
A: I’m not sure “container” is the requirement, but that’s certainly one approach. It does need to be consistent across developers and production, and containers is a good way to get there. “It worked on my machine” is a pretty strong indicator that the software process isn’t quite as reliable as it should be.
Q: If you are working on projects which are expected to last years or decades would you rather use standard code versus proprietary functions, like it is an option in SQL (ISO vs. Microsoft, IBM, Oracle proprietary functions)?
A: It depends, but largely I think it’s cheaper to build a thing yourself than to migrate off of an interface that you’ve already been using for years. Building it yourself means you pay more up front and have more ongoing maintenance (and training costs), but you won’t have the same sorts of surprises when that vendor goes away or you lose license rights and have to scramble to migrate away. That balance can go either way depending on the timeframes and circumstances.
Q: Do you have advice/insight on applying automated fixes (e.g. clang-tidy fixes, clang-format fixes) over the large codebase. Would you touch old modules not touched in several years?
A: Yep, definitely. The SREs have an important phrase, “No Haunted Graveyards.” This means there can’t be things in your software environment that people are afraid to touch - those are going to be where the bugs come from, and ignoring it just exacerbates the problem.
Q: Do you feel the field of software engineering in academia is properly addressing problems encountered in the wild, given that most groups/professors/PhD students are NOT working on “multi-person multi-version projects” themselves?
A: Generally not. There’s certainly some good work there, but I do see a fair number of papers in software engineering conferences/journals that are more “anthropology of software engineering” than software engineering. “We studied 20 software engineers and observed the following behaviors” is often not the same topic as “What is the role of testing in a high-performing organization” or “Why is it easier to refactor a function than a class/type?”
Q: Sometimes you don’t know the time aspect of the software. How do you deal with the problem when it started as a short time span code to a longer duration code. Do you have some inspection mechanism to handle that?
A: Nope. Try to overestimate? Being aware of the problem is a big step in the right direction, but it’s impossible to be perfect: that’d require us to accurately predict the future.
Q: There’s a strong tendency in many software cultures to favor flat organizational and communication structures that have a lot of democracy and broadcast-style communication, but does that negatively affect super-linear scaling and communication/synchronization costs? If so, is a deeper hierarchy in the organization the only solution?
A: Google seems to keep “discovering” that ~everything needs an owner or decider. The open discussion is still important, and most things deserve some level of consensus to avoid “because I said so,” but you can’t really have everyone have a veto on every topic. You can’t even really ask everyone to learn the details of everything. I don’t think hierarchy is inherently the answer, but “Everyone is in charge!” certainly fails as we grow. I suspect there’s a lot more (and better) information from business and management and organization thinkers - we’re not unique on this.

Question: How many of these challenges can be addressed, albeit incrementally, by better engineered (AI-enabled) IDE’s ?
Answer: My uninformed gut estimate is that an all-powerful IDE could at most be a 2x productivity boost vs. a just-OK editor/IDE. But without future-vision or generalized-AI, the full set of communication and planning and expertise problems won’t go away with tech. It takes … communication, future planning, and learning, I guess.
Question: Do these principles apply well for an embedded software ?
Answer: Yes.
Question: A common issue in SW eng projects is deciding when to use a tool that already exists, but that you have to study and learn, vs writing a special purpose tool yourself to solve your immediate problem. It often seems easier to just write a quickie yourself, but in the long run you end up reinventing the wheel and resolving old problems. How does Google balance this trade off?
Answer: Not very well, honestly. Our engineers have exactly those same instincts. Again, I think of it as an educational gap: we need to tell new grads that it’s OK to not be writing code, that asking questions and learning is a huge part of their job. And we need to publicize the painful lessons that come up when we keep re-inventing things and have to decide between overlapping tools. Consistency is undervalued.
Question: How can we improve the way people are educated (i.e. in SWE classes at universities) to have more feeling of time / scale / practical decision making?
Answer: I’m not sure yet, but I’m working on it. I think it certainly starts with talking about it. So much of “software engineering” education for undergrads is really Project Management lite, and that’s not really the set of skills that a new grad would really benefit from.
Question: Is the monorepo approach inextricably bound to the Time/Scale/Tradeoffs pillars? or is it merely a decision that has worked well for Google?
Answer: I think the monorepo falls out of thinking about the software engineering workflow/environment in terms of Time and Scale, not the other way around.
Question: What are your thoughts of the situation were “flaky” tests (unit, integration) start to take over the build? Get rid of them? big bang refactoring?
Answer: Overall, I think we need to move to a model where we treat test results as a signals-processing problem, not a simple binary where any failure means STOP. If you’re running a ping test from an external IP to see if your server is still up, you don’t freak out when one ping gets lost, and I don’t think a single test failure should necessarily be treated any different than that. But from that same signal processing perspective, if your tests are so flaky that you can’t get signal from the noise, you need to invest some real work to de-flake it.
Question: For how long (absolute time and total time) run quality checks (unit tests, code scans, code reviews …) before code can be submitted (pull request merged) at Google? Or do you have different approaches for different situations?
Answer: Wildly different approaches for different teams, I’m afraid.
Question: How do you balance the sharing of skill and prevent dependency of certain expertise in single individuals?
Question: You mentioned being able to upgrade over time as an important feature of sustainable software. Do you think that also applies to the very foundation of programs, i.e., the programming language itself? Does Google plan to upgrade programs, say from C%2B%2B to Rust?
Answer: Yes, I feel very strongly that programming languages need to be able to evolve. I’ve stepped back from C++ standards work specifically because that language has gotten itself into a place where it can only evolve through addition, and that leaves a lot of bugs and inefficiencies.
Question: When talking about best practices, many of them are formalization of some other engineers. There are times that some of them might actually be bad. Sometimes, they might not apply to one project. Other times, you have to deliberately suspend or abandon one in favour of other practices or design decisions you’ve made. How do we resist the urge to letting too much architecture and practice to creep into our development? Is that even a real problem?
Answer: It’s very important to recognize that software engineering is young (50 years or so) and still evolving rapidly. Best practices are not the same as Newton’s Laws or other truths of physics, they’re refined over time from the experiences of software engineers. Those engineers that can admit when they are wrong and change course are probably the ones you should go to for best practices - their collection will have evolved. Those engineers that Loudly Insist That They Are The Smartest Ones should probably not be in charge of best practices - their experience may have been valid for them, but whether it generalizes (or applies to current problems) is a little more iffy.
Question: What are best practices about testing in Google? Does Google practice Test Driven Development?
Answer: We advocate for it, but it’s a large place and we don’t mandate much, so it depends on the team.
Question: Billy’s Law: all software lifespans will be at least one order of magnitude longer than originally expected.
Answer: Feels accurate.
Question: How do you know features of a software aren’t used anymore?
Answer: SREs will tell you that you can’t run reliable software in production without monitoring. I think the same is true of the codebase: you have to have some monitoring and visibility tools if you want to manage even a medium-sized codebase. Indexing, code search, indicators of what is still included in the build … these become increasingly important as you scale up.
Question: When you hack a line of code, you can leave a TODO for future. What should be done if you hack at larger scale?
Answer: It isn’t an accident that Google’s style guides changed the preferred form of TODO from TODO(username) to TODO(bug number). File a bug, categorize it as “internal cleanup,” don’t over estimate the priority of it. Leave as much background information as you can, in case it gets handed off to someone else.
Question: How important is to teach students about secure coding in computer programming courses?
Answer: Very.
Question: What are the examples of Google software written 10%2B years ago that are still being maintained and operational? (Software that hasn’t been completely re-written / re-engineered)
Answer: Mostly infrastructure pieces. The first versions of some of our common infrastructure (string routines, command line flags, logging, etc) go back at least 20 years. They evolve, but you can still see the influence of the original designs. Higher-level product stuff goes through the same sorts of evolutions, but at a more macro scale.
Question: Hardware is imperfect, OS are imperfect so you will never have a perfect software.
Answer: True. The chapters on releasing in the book talk a lot about there never being a perfect release, it’s more interesting to think of it as “defect neutral” or “value positive.”
Question: Are you saying that we should think of technical debt as a debt with really high interest? Say, 50%?
Answer: Yes and no? The payoff cost is quite high in general, but it doesn’t really compound in the same way as a financial instrument.
Question: Why not aim for the first-version be the best version and avoid or curtal future changes?
Answer: Any system where success is predicated on “The first version will be perfect” is either extraordinarily expensive or extraordinarily foolish. NASA can pull it off, but not by choice, and they do it with great expense and with a lot of time invested.
Question: We consider Software Engineers… real engineers In our organization, so we tend to have very broad responsibilities for SE spanning from design, architecture, testing, security, infrastructure (as code), deployment, and monitoring. Of course we have champions specialized on specific areas that help improving practices and automating. How do you see this approach to improve a complete understanding to the software engineering work?
Answer: That sounds very wise. Are you hiring? :stuck_out_tongue:

The book seems not to discuss documentation of requirements. What best practices do you recommend about the effort to put into these?