Events
Featured
Tricentis Virtual Summit: Delivering software innovation at DevOps speed

Learn the latest from top thinkers in Agile, DevOps, and more. Sessions are now available on demand.

Watch now
Transformation
Featured
Your transformation toolkit

Advance your enterprise testing strategy with our transformation toolkit.

Learn more
image

DevOps

Horror stories from the depths of DevOps

Ever had a hungry ghost take over your Kubernetes environment? How about a dueling match with the clever adversaries behind Octopus Scanner? Perhaps you’ve felt compelled to liken old, stubborn CI/CD practices to flying monkeys in The Wizard of Oz. Whether you can relate or not to these spooky DevOps stories, there are certainly lessons to be learned from each. Gather around the proverbial campfire and prepare for some scares.  

In this DevOps Unbound roundtable, host Alan Shimel, DevOps.com (now Techstrong Group) CEO and Editor in Chief, is joined by co-host Mitch Ashley, CEO of Accelerated Strategies Group (now Techstrong Group), Alexander Sascha Mohr, Product Manager at Tricentis, Dr. Parag Doshi, VP of Engineering at Tricentis, Tracy Ragan, CEO and Co-Founder of DeployHub, Jasmine James, Engineering Manager at Twitter, and Derek Weeks, SVP and Chief Marketing Officer at The Linux Foundation and Co-Founder of All Day DevOps, as they share their most horrifying DevOps testing experiences and lessons learned. 

View the full session video below. 

Hunting for our Phoenix Project’s Brent 

[Mitch, 07:46] It started with my journey into DevOps, about 2013 or so. Alan, you told me about this thing called DevOps, and so I ended up re-reading The Phoenix Project. And I thought, this is great… let’s see what we can do. We did a little book club at work with the IT group that I was running, and I said, let’s read this together. But one of the things that happened is that they suddenly started sort of misusing the Halloween theme, a kind of witch hunt for the “Brent” in our group. If you remember, Brent from The Phoenix Project was the bottleneck for everything. So, we’re trying to figure out, who was our Brent? We kind of figured out that we all had a bit of Brent in us, in our own ways.

The test center that never was 

Sascha [10:47] So one of my favorite stories is from a discussion with a customer about one and a half years ago. They were undergoing a digital transformation, and we discussed implementing a test center of excellence (TCoE). When you think about the test center of excellence, typically you can imagine it like a car manufacturer or a production line, when you get all the parts like the engine and the gearbox and the chassis. And you need to assemble all these together and then validate that what you’re building is really working as expected. Then, when we talk about DevOps, the question is always; does a test center of excellence fit into the DevOps world, or should we eliminate it? And it was for me kind of eye-opening because the customer said, “We think we need a test center, but our management is planning to reduce the center completely.” Their vision is we don’t need test centers, and that the team should do everything. One year later, when we checked in with the customer, they were without tests and without any kind of governance. They basically said, “It hasn’t worked out for us.” And that’s what we currently see when it comes to testing. Testing is not just what the team is doing, but it’s always around the collaboration and the end-to-end test validation across the pipelines and across the application landscapes.

Not letting old habits die hard 

Jasmine [13:21] I am calling myself the Wicked Witch of Cultural Transformation. And it’s interesting that cultural transformation was number one [referring to an earlier poll where 50% of respondents indicated “culture” being their organization’s most troublesome area as it relates to DevOps] because that’s what my story stems from. I’m titling it also “lions and tigers and cultural transformation, oh, my!” So when you think about the Wicked Witch, within the Wizard of Oz story, yeah, that’s who I am. Thinking about my time at the multiple companies that I’ve been a part of, I think that this has been one of the most difficult things to really drive home — especially within large organizations. I led not only the tooling efforts for the tools that enable DevOps, but also how we level up on skills and create a culture of DevOps within the teams. That was the most troublesome jump for folks. When I was a part of a large airline, with a lot of the growth that we saw, one of the things that really came bac  over and over again was those old habits. It’s very hard to break old habits as they relate to CI/CD practices — you know, code quality, things of that nature, especially if they didn’t exist previously. I liken it to the flying monkeys that attacked Dorothy and her crew as they were traveling to the Wizard of Oz. They just kept popping up — those old habits. Enticing teams to go back into their old ways. So thankfully, we were able to create that continuous adaptation to create value with teams and meet them where they were. And that’s really what helped us drive that cultural adoption and DevOps practice within the teams to not go back into the old world. 

[Alan] How do we get past cultural challenges where the culture is highly driven by regulation, like in the banking industry? 

Stay safe out there 

Jasmine [15:53] I think the most regulation that we have is around privacy and data. But I think that cultural challenges are everywhere. I think it just takes a lot of tenacity from the team that’s driving that culture adoption and making sure that they’re creating an understanding within the teams and understanding within leadership — what are the benefits of doing these things, versus doing it just “because”. We have to understand the value. And then also relating it back to safety. When thinking about regulation, it’s all about safety. How can we even leverage some of the DevOps tooling standards to drive additional safety within a regulated industry? 

Make the CISO your friend not foe 

Parag [17:08] So speaking of regulated industries, in addition to financial services, another one is healthcare. And one of the stories there relates to that cultural shift. Even in healthcare, we’ve seen that to be a problem, because I can tell you a horror story where we had our member portal for healthcare insurance go down one day, and the immediate reaction of all the executives is, what’s going on? Let’s fix this as fast as possible. How did this happen? Who let the defect out? And instead, what the engineering team did was, within seconds, redeployed the entire site, which was unheard of a year prior to that. This engineering team prioritized recovery over the impossible challenge of never having a defect in production. That’s just inherently not going to happen. Assume that things will happen in production as you release. And if you can prioritize how quickly you can recover over having a 0% defect leakage rate, I think that can go a long way. The second point, I’d say, in regulated industries, is, as Jasmine correctly pointed out, it’s about security, data privacy. Here, culturally, one of the most important factors that I found is to make the CISO, the Chief Information Security Officer, your best friend. I was a VP of cloud, but if I did not have my CISO on my side, who also saw the vision, we would have been hopeless. And so we work together to build out a security roadmap and tie in all the necessary security controls before we even ever considered putting protected information into, or even talked about putting that into, a public cloud.

Beware of microservices sprawl 

Tracy [22:34] Well, I call this story the hungry ghost story. This happened right around the time that we all decided that object-oriented programming was going to be the way to go. This was around OS2 as well. So OS2 has some special things that happen when you build applications for OS2. And it’s the story about SOAR 3030.DLL. This was for a financial company that has a strictly audited process. And because it’s an OS2, when you load up a DLL into memory, if it’s already there, it doesn’t go reload it. And oftentimes when people would go to do a standard error, and this SOAR 3030 was all about a standard error routine, all the teams were required to use the same one. But we would get traps, that horrible tombstone that we used to call an OS2 would show up, and it would be in the middle of a transaction. In the middle of working with a customer where there was an error routine, it would go call the SOAR 3030, and it would create a tombstone, I would call it at the time. And I was given the task to go find out what happened. So I decided to just do a search on all of the servers throughout this entire organization which had literally thousands of servers for SOAR 3030.DLL. And I piped it to the printer, thinking I’d get a page. And I went home. The next day, I went to the printer in the hall, and it had stopped. And I was like, hmm, that’s interesting. It stopped and I looked at what had printed out, and it was a pile. 

When I dug a little deeper, I realized that every single application team was compiling shared libraries from their own repository, meaning that they had their own versions and they didn’t rename them. Now the industry decided to get really brilliant and solve this problem, not by solving the make step, which was actually would have been a lot easier, but they decided to rename those libraries instead. And so now, instead of having a central place to fix things, you had to fix every version. Now, the reason why I bring up the story about SOAR 3030 is because we are entering the same exact situation with microservices. Microservices, if they’re implemented correctly, should be highly reused, there should only be one common error routine. So if you need to fix it for the organization, you fix it, push it back out to your Kubernetes cluster, and depending on how your Kubernetes deployment file has been defined, everybody should pick it up. Similar to you know, in our healthcare situation, if we have a ransomware issue in a Kubernetes cluster, we should be able to fix it in one place, and everybody gets the repair. So don’t let a hungry ghost take over your Kubernetes environment, because that’s called sprawl. And in my case, during the object-oriented days, we had sprawl of a single function. That’s the scary part and the complexity around microservices.

Octopus Scanner vs open source 

Derek [38:35] I have a bit of a trick-or-treat story. Starting with the treat. I spent a lot of time studying high-performance software development organizations, software supply chains, open-source use, and software development. And a lot of the use of open source within software development is part of the enablement of DevOps in moving faster in development. No one has to write all their code from scratch anymore, you can just download bits of code from the internet in seconds, use it for free, and that allows you to move faster. And about 80% of an application out there is open-source binaries or open-source packages that are built in. And something that I’ve seen since 2018 is the number of open-source related breaches decline year over year. They’ve gone from about 30% of organizations to about 20% of organizations and a lot of that has to do I think with the post-Equifax breach scenario. People are saying, I don’t want to be the next Equifax, so they’re making the investment. But the kind of trick in this, or goal in the pipeline, is that adversaries know that organizations are looking for the known vulnerable open-source components out there. So the adversaries have shifted tactics in the last few years to look at how to inject malicious code into the actual project and then have it downloaded by 1,000, 10,000, 100,000 people a week… We saw about 1,000 instances of those open-source project-based attacks last year, but the real kind of head scratcher that we saw was this vulnerability called Octopus Scanner. And it was discovered by GitHub researchers, and what this did was it started to modify the NetBeans IDE that developers were using, and the NetBeans IDE would inject malicious payloads into any jar that was being built in the IDE. So they’d just go to the developer tools themselves, inject malicious code into anything that they’re building. So the known vulnerable pipelines are the easiest to find the maliciously injected code, whereas at the beginning of a supply chain it’s a lot more difficult, if not impossible to identify.