Video: Solving Visibility Challenges Behind Public Sector IT Incidents | Duration: 3612s | Summary: Solving Visibility Challenges Behind Public Sector IT Incidents | Chapters: Series Introduction (27.630001s), Visibility Challenges (96.495s), Tool Sprawl Challenge (215.6s), Incident Response Challenges (268.42502s), Tools Sprawl Challenge (364.62997s), Tool Consolidation Roadmap (469.56s), AI Terminology Clarified (554.16s), AI-Driven Observability (1015.14s), Breaking Down Silos (1286.34s), Case Study Example (1625.755s), Citizen Service Impact (1774.2201s), Key Takeaways (1842.95s), Next Steps Forward (1898.285s), Closing Remarks (1979.3351s)
Transcript for "Solving Visibility Challenges Behind Public Sector IT Incidents":
Two in our four part webinar series on observability and public sector IT. In our first episode, we laid the foundation. We talked about zero trust architecture, federal and state mandates, and why observability is the bedrock of modern cybersecurity. If you missed it, I'd encourage you to go back and watch that one. Today, we're shifting focus from security frameworks to operational reality. We're gonna talk about the visibility challenges that cause public sector IT incidents to last longer than they should, cost more than they need to, and erode citizen trust. I'm Travis Gallo, and I'm joined today by three experts. Josh Stogberg is the group vice president of product management at SolarWinds, and he leads our product vision and AI strategy for the observability platform. Scott Pross is back from episode one. He's our vice president of technology at SolarWinds. He's our expert in professional services and product implementation. And Brian Chamberlain, also returning from episode one, handles public sector business development at SolarWinds and brings over twenty years of military IT and cybersecurity experience. Gentlemen, great to have you all here. Glad to be here, Travis. Yeah. Thanks, Travis. Forward to this one. Today, we're gonna cover the state of visibility in public sector IT, the real cost of tools for all and disconnected monitoring, how AI and AI ops reshaping operations, and how SolarWinds is helping agencies solve these challenges. So let's get into it. Brian, let's set the stage. You talk to public sector IT leaders every day, and not too long ago, you were in their shoes. What are some of the challenges you faced in the marine corps that persist today when it comes to conversations around monitoring and observability? Yeah. Thanks, Travis. So, you know, two things that come to mind right away are integrating systems across a complex enterprise and actually being able to see everything on your network. So let me take you back to 2014. I was communications planner for, for a major multinational amphibious naval training exercise. That included 19 partner nations. You know, we had about 17,000 service members, you know, 19 naval vessels, 450 aircraft. So pretty pretty big. And and here's the thing, a lot of our networks and applications across different war finding areas, it just won't talk into each other. So what what did I do? I adopted a common platform, and I pushed them out over our partners where interoperability, what was impossible. So anything that wasn't absolutely essential to the mission, I cut it. Right? So that same mindset applies directly to how we think about networks today. The people, responsible for sustaining and defending the network are are so much more effective when they have a common operational picture. That's one observability platform that shows them everything. It tears down the silos between these teams. And frankly, you know, that that's real game changer. And just like in that exercise, you you've gotta look hard at the applications running on your network. Duplicative, excessive tools, you know, those create fatigue. And it's not just essential, it's a liability. Josh, SolarWinds recently published the 2026 state of monitoring and observability report. What does the data tell us about this problem? The data tells us exactly what Brian is talking about. Right? Organizations, public and private sector, are struggling with tool sprawl. The average IT organization is managing something like 10 to 15 monitoring tools, and the challenge isn't just the number of tools. It's that each tool generates its own alerts, its own metrics, its own view of the world. So when you have 15 tools and an incident occurs, you're not dealing with one alert, you're dealing with dozens, sometimes hundreds, all coming from different sources with different formats, different thresholds. The report also shows that organizations adopting unified observability platforms and AI driven correlation are seeing significant improvements in both mean time to detect and mean time to resolve. Scott, you've been in the trenches implementing these solutions. Paint us a picture. What's a typical public sector IC to an incident look like when there's no unified visibility? So it's a real familiar story. You know, a public facing portal goes down, let's say, this deep benefits application system. The help desk starts getting calls, they they check the monitoring tool, and they see that the application's unresponsive. They escalate it to the application team. You know, the app team checks our monitoring and sees that the servers are fine. CPU is normal. You know, memory is fine. Everything looks good, so they point to the network. Then the network team pulls up their dashboard and everything looks green to them. Meanwhile, the database team isn't even in the room yet because nobody thought to even check the database. An hour later, somebody finally looks at the database and discovers a long running query or locking tables, and it's causing timeouts. The whole incident could have been resolved in minutes if there had just been a single platform correlating the application response time and the database performance together with the network latency. Instead, it took an hour of finger pointing and context switching between the tools. And honestly, that's the best case scenario. Yeah. Because at at least it was a performance issue. You know, when it's a security incident, the the stakes are a whole different ballgame. So so we actually talked about this in episode one, how observability supports every pillar of Zero Trust. And the same principle holds true for operations. You know, if you can't correlate network anomalies with application behavior and user activity in real time, you know, you're gonna miss threats, and you're gonna waste time that you simply don't have. Let's dig deeper into the tools sprawl. Josh, why did agencies end up with so many monitoring tools in the first place? I think it happens organically. Network team needs visibility, so they buy a network monitoring tool. The application team has different requirements, so they buy an ABM solution. Security needs a SIM. The database team needs a performance analyzer. The virtualization team needs their own tool. Each purchase makes sense in isolation, but over five or ten years, you end up with a patchwork of tools that were never designed to work together. And in public sector, procurement cycles make it even harder to course correct. You're locked in the multi year contracts with different vendors on different renewal timelines. Brian, what does that cost agencies in practical terms? Beyond the obvious costs, licenses, maintenance, you know, those can add up fast. You know, I think the real hit is is operational. So every tool, as an example, it needs training. Every tool has its own, alerting logic, which means you end up drowning in duplicate and conflicting alerts. Your your staff isn't managing infrastructure anymore at that point. Right? They're managing tools. Now I've I've worked with agencies that had full time employees whose entire job it was maintaining integrations between monitoring tools. So so think about that for a second. That's not IT operations. That's tools operations. And when you start factoring in the downstream cost, extended outages, overtime, you know, emergency vendor support calls, the total cost of ToolSprawl blows right past whatever the agencies are actually spending on the tools themselves. Scott, when agencies come to you wanting to fix this, what's your approach? Well, we start with a tool inventory and assessment. You know, what are you running and what does each tool actually do? Where's the overlap? Where are the gaps? Almost every time we find significant overlap. There are four tools to monitor all the network devices, for example, but none of them are monitoring East West traffic. Or multiple tools collecting the same logs, but none of them are correlating the logs with the application performance data. So once we have a full picture, we build a consolidation roadmap. The unified platform, alongside existing tools, validate that you you have coverage parity, and then retire the legacy tools one by one. We've done it a dozen times in the public sector, and typically, the result are 30 to 40% reduction in tool count with significant better visibility. But you don't have to rip and replace everything at once. You could do this in a phased migration so that there's no pain involved when you're actually doing it. I'd add that let's say, core design principle for us at SolarWinds. The observability platform is built to be that single platform. Network performance, application monitoring, database performance, log management, security event management, configuration management, all integrated, all sharing data, all on a single pane of glass. We didn't build a network tool and then bolt on application monitoring as an afterthought. The platform is designed for cross domain correlation from the ground up. Before we go further, I wanna take a few minutes to level set on terminology because in the AI space, words get used loosely and that creates real confusion when public sector leaders are trying to make procurement and policy decisions. Josh, you work in this space every day. Let's tackle some of the misconceptions you hear most often. Yeah. This is probably the most valuable thing I think we can do on a regular basis before we get into the technical discussion because terminology is genuinely inconsistent and changing every day, right, even among vendors and, frankly, even across government documents. So let's clear a few things up. Let's start with the biggest one. When a vendor says their product uses AI, what does that actually mean? Honestly, without more context, it means very little. AI is an umbrella term, and that's not just my opinion. That's reflected in how it's defined under law. Under the National AI Initiative Act codified at 15 USC ninety four zero one, artificial intelligence is defined as a machine based system that can for a given set of human defined objectives, make predictions, recommendations, or decisions influencing real or virtual environments. That definition is deliberately broad. It covers machine learning, deep learning, generative AI, rule based expert systems, and more. So when a vendor says our platform uses AI, they could mean a sophisticated anomaly detection model trained on millions of events or they could mean a threshold rule with a marketing label on it. Always ask what kind of AI and how does it actually work? What about machine learning? People use that interchangeably with AI. Is that accurate? It's not precisely. Machine learning is one method for achieving artificial intelligence. It's a subset, not a synonym. Not all AI uses machine learning, and not all machine learning produces what most people picture when they think of AI. When you're evaluating a platform, the distinction matters. A machine learning model that learns patterns from your telemetry data over time is fundamentally a different thing from a rules engine that someone programmed with if then logic. Both might be called AI, only one of them is actually learning. And that brings up AI ops specifically, which is a term we'll be using throughout today's discussion. Is that one of those that usually lose terms? It's worth being direct about this. AI ops is not defined in any statute. NIST hasn't published a standard definition for it. It's it's mainly an industry marketing term, which is exactly why you need to push past the label when evaluating tools. The technical concept is sound, applying AI and machine learning to IT operations data for anomaly detection, event correlation, and automation. But one vendor's AIOps might be a genuinely sophisticated ML model correlating telemetry across thousands of data points in real time. Another's might be a basic alert threshold, like dressed up with a new name. When you're evaluating platforms, don't buy the label. Ask what the model is actually doing, what data it was trained on, and how it makes decisions. Here's what I hear from skeptical executives. AI hallucinates, so how can I trust it for security operations? How do you respond? That's a great question, and I want to be precise here because the terminology actually matters for this audience. NIST AI 600 dash one, the generative AI profile published under the AI risk management framework uses the term confabulation as the formal technical term for what most people call hallucination. NIST defines it as the production of confidently stated but erroneous or false content and characterizes it as a natural consequence of how generative models are designed. They predict statistically likely outputs based on training data, which can diverge from fact. Hallucination is noted by NIST as the colloquial term. For a government context, confabulation is the more precise language. Now, why does that distinction matter practically? Because confabulation is a risk profile specific to generative AI, particularly large language models. When we talk about AI in observability platforms, anomaly detection, pattern correlation, event analysis, we're generally talking about a different class of models trained on structured telemetry data. Those systems have different failure modes and different risk profiles. The right response to confabulation risk isn't to reject AI across the board, it's to understand which type of AI you're deploying, design for human review where the stakes are high, and used grounding techniques that anchor outputs to verify data, all of which the NIST AI RMF recommends explicitly. Speaking of human involvement, we hear autonomous AI, autonomous response frequently. Should agencies be comfortable with that framing? This is critical for government, and the NIST AI risk management framework speaks directly to it. The RMF places significant emphasis on human oversight throughout the AI life cycle. It's one of the framework's core trustworthiness characteristics, And the statutory definition of AI itself frames these systems as operating under human defined objectives. What that means in practice is that most responsible enterprise AI operates in what I call an assisted or augmented mode. The AI surfaces a recommendation, a correlation, a risk score, or a suggested action, and a human approves it. True autonomy, the system acting entirely without human involvement, is rare and should be scoped to very narrow, well defined, low stakes tasks. For public sector agencies managing critical infrastructure, nine one one services, or citizen data, human in the loop is not a limitation. It is the appropriate design. We talked in episode one about automated response as a zero trust tenant. And I want to be clear, automation with human oversight is not a compromise of that tenant. It is the responsible implementation of it. If a vendor telling you their system is fully autonomous with no human review, that should prompt questions, not complex. Alright. Last one. I've heard IT leaders assume that AI systems are constantly learning from everything flowing through them. Is that true? Actually, it's almost never true. And this matters significantly for agencies thinking about data sensitivity and governance. Most deployed AI models are static after training. When your operations team is interacting with an AI assisted platform, the model is not absorbing your operational data and updating itself in real time unless the vendor has explicitly built a continuous learning pipeline and is being transparent about it, which is rare. What this means practically is that your sensitive telemetry, incident data, and network information is not being fed back into a shared model. That is actually a data governance advantage, and that's something agencies should verify explicitly with any vendor. Ask, how is this model trained? On what data? And Does my operational data ever flow back into model updates? If a vendor can't give you a clear answer, that tells you something important. So what's the bottom line for public sector leaders evaluating AI assisted tools? I'd say there's three questions. First, what kind of AI is this? Specifically, not AI, not AI ops, but what mechanism? What model type? What was it trained on? Second, where does a human remain in the loop, and is that documented in the system design? Third, is my operational data isolated from training pipelines, and can you show me that in writing? If a vendor can answer all three clearly and consistently, you're starting from a credible foundation. If they can't, well, there's your answer. Excellent framing. With that foundation in place, let's get into it. Josh, how is AI changing observability? AI is maybe the most significant evolution in observability since we moved from simple up down monitoring to full stack observability. There are several dimensions to this. First, intelligent alerting. Traditional monitoring uses static thresholds. If the CPU exceeds 90%, fire an alert. The problem is that 90% CPU might be a perfectly normal, occurrence during a batch processing window, but a serious problem at 2PM on a Tuesday. AI learns what normal looks like for each device, each application, each time of day, each day of the week, and alerts when behavior deviates from that line baseline. That alone can reduce alert volume by 70% while improving detection accuracy. That's a significant reduction. What about correlation? Well, that's the second dimension. When the incident occurs, it doesn't generate one alert. It generates dozens or hundreds as the impact cascades across systems. AI driven event correlation groups all those related alerts into a single incident, identifies the probable root cause, and presents the analyst with a coherent story instead of a wall of noise. Instead of an analyst spending thirty minutes triaging 200 alerts to figure out what's actually happening, the platform does that in seconds and says here's the incident, here's the probable root cause, and here are recommended remediation steps. Brian, how does this resonate with public sector IT leaders with cautious optimism? And I mean that genuinely. You know, public sector leaders are pragmatic. They've lived through enough technology hype cycles to know better, but here's what's different right now. They're also staring down a real workforce crisis. They can't hire enough skilled people, and the folks that they do have are are completely overwhelmed by alert noise and manual triage. So when they see I AI that can cut alert volume demo like, say, 70% and surface the actual root cause in seconds, you know, they get it. They recognize the force multiplier potential immediately. You know, there's a growing optimism among IT leaders, around AI driven observability. And, honestly, I'm seeing that same shift firsthand in my conversations with agencies every single day. Scott, can you give us a real world example? Sure. You know, the county government has about 1,200 managed nodes, a three person IT operations team. They're drowning in alerts. Right? On an average day, they're gonna get 300 to 400 alerts across our monitoring tools. And most of them were noise. You know, they're just transient threshold breaches, duplicate alerts, apple from overlapping tools, informational messages that didn't even require any action. But buried in that noise were real incidents. You know, we were able to deploy Solon's observability with intelligent alerting and event correlation. And within, like, four weeks of the AI learning the new environment, our volume dropped by 70%. But more importantly than that, the alerts that they were getting receiving, you know, they did find it. They were actionable. They pointed directly to the issue with context about what was affected and what to do about it. That three person team went from spending most of their day you know, triaging alerts to proactively managing their environment. And the predictive analytics capability takes it a step further. Instead of waiting for something to break, the platform identifies trends, storage volumes approaching capacity, network interfaces showing increasing error rates, application response time slowly degrading, and alerts the team before there's a user facing impact. For public sector agencies running citizen facing services, that shift from reactive to proactive is transformative. Yeah. And, you know, I wanna I wanna connect this back to episode one for a moment. You know, we talked about zero trust requiring continuous monitoring and automated response. Now, AI driven observability is a technology that makes those tenants practical at scale. You you can't have a human analyst continuously monitoring every single device, every application, every data flow in this modern hybrid environment. AI, on the other hand, it can't. You know? And and automated response becomes much more effective when it's driven by AI that understands context rather than the simple threshold rules. Let's talk about the people side of this. Brian, you mentioned silos earlier, which could include leadership, procurement, IT, etcetera. How big of a problem is this? Yeah. I think it's a good question, Travis. I think it's one of the biggest challenges in public sector IT. And and it's often harder to solve than the technology challenges. You can buy the best observability platform in the world, but if your network team, application team, security team, and the help desk are all working in isolation with no shared processes or shared data, you're still gonna have slow incident response. You know, the the technology enables collaboration, but organizational change, you know, that's gotta happen alongside it. Scott, how do you approach this when you're working with agencies? So we build shared dashboards and shared workflows from day one. So when you deploy solar visibility, we don't just build a network dashboard for the network team and an application dashboard for the the application team. We're gonna build it based on technology service dashboards and show the health of a specific service like, you know, a a citizen benefits portal across all layers. So included in there is gonna be network connectivity, server health, application performance, database response time, and security events. Every team can see the same data. And when an incident hits, we're all looking at the same picture. So we also integrate alerting into the shared ITSM workflows so the incidents are triaged centrally and not by 17 cubes. And the platform is designed to support this. SolarWinds observability provides role based views. A network engineer sees network centric data and application developer sees app centric data, but it's all coming from the same underlying telemetry. When they need to collaborate on an incident, they're speaking the same language and looking at the same time element of events. That shared context is what eliminates the finger pointing Scott described earlier. Let's bring this together with how Solowin specifically addresses the challenges. Josh, from a product perspective, how does the platform tackle tool consolidation? SolarWinds observability is designed to be the single platform that replaces the patchwork of point tools. Let me walk through the key capabilities. For network visibility, you have network performance monitor and net flow traffic analyzer, providing device health, topology mapping, and deep traffic analysis. For servers and applications, server and application monitor covers over 1,200 applications out of the box with custom monitoring capabilities. Database Performance Analyzer handles database monitoring across SQL Server, Oracle, MySQL, Postgres, and more. Log Analyzer provides centralized log management. Security event manager handles security monitoring and SIM. Network configuration manager handles device configuration compliance. And all of these share data through the observability platform, a single console, single database, single alerting engine. Brian, in episode one, you and Scott walked through how each of these modules maps to zero trust pillars. How do they map to the visibility challenges we've been discussing today? Yeah. It it maps directly. The tools for all problem exist because agencies, they they purchase separate tools for each domain, network application, database security, you know, you name it. So SolarWinds built all of those capabilities into a single integrated platform. The silo problem exists because each team has their own tool, you know, and their own data. SolarWinds gives all teams a shared platform with shared data. The the correlation problem, not being able to trace an incident across infrastructure layers, that is solved by having all the telemetry in one place where the platform can correlate automatically. And with the AI capabilities that Josh described, the platform doesn't just collect data. It it actually makes sense of it. Scott, let's talk about the hybrid cloud piece because a lot of agencies are at a hybrid state. Some more close on prem, some in the cloud, etcetera. Yeah. You know, that's real the reality for both of the of the agencies we work with. You know, they're not all fully cloud and they're not all fully on prem. They're somewhere in between. And that visibility gets really challenging. Your on prem tools don't see your on, you know, your your cloud your cloud native monitoring don't see your on prem. And you're back to that silo problem. You know, SolarWinds observability, it bridges that gap. So you can monitor AWS, Azure, your gov cloud resources alongside your on prem infrastructure for the same platform. You get a single topology map that shows dependencies across all your environments. And, critically, as we discussed in episode one, the self hosted platform could be deployed on AWS ClubCloud or Azure government for agencies with those requirements. So while the on premise deployment option doesn't require FedRAMP authorization, you get the best of both of rules. I also wanna highlight the API and integration story. No platform exists in a vacuum. Agencies have ITSM systems like ServiceNow, security tools like EDR platforms, identity providers, and more. SolarWinds observability has extensive rest APIs and pre built integrations, so it fits into the broader ecosystem. The goal isn't to replace everything. It's to be the observability backbone that integrates with specialized tools the agencies need to keep. Scott, can you share an example that ties all of this together? Sure. You know, I got a state agency of about 5,000 employees. They have a mix of on prem data centers and AWS cloud. They had 12 different monitoring tools across three teams, network operations, server and application support, and a security operation center. Each team had their own dashboards, their own alerting, their own incident queues. The meantime to to resolve for priority one incident was over four hours because of the coordination overhead. We did a full assessment. We mapped the monitoring landscape, and we built out a consolidation plan. Over six months, we were able to migrate them to SolarWinds observability, NPM, SAM, DPA, NTA, and SEM, replacing nine other 12 legacy tools. We kept three specialized tools that built genuine niche requirements and integrated them via, you know, an API. We built shared service dashboards for the top 10 citizen facing applications and integrated alerting with their ServiceNow incident. The results were simple. Tool count went from 12 to six, license and maintenance cost dropped by 35%, and meantime, the resolution of the repute one instance dropped from over four hours to under ninety minutes. But the biggest win is to culture. For the first time, network ops, app support, and the SOC were all looking at the same data. Incident triage became a collaborative process instead of a blame exercise. And let me tell you, the teams really appreciate that. Let's talk about the bottom line. Brian, how does improved observability translate to cost savings for agencies? I think there are really two buckets here. You've got direct savings and indirect savings. So on the direct side is tool consolidation. Fewer licenses, fewer maintenance contracts, less time and money spent on training. So that alone can be pretty significant. The indirect savings, is where it gets really interesting, though. So faster incident resolution means less over time, fewer emergency vendor calls. And when you layer in AI driven, you know, proactive management, you're actually preventing incidents that would have been expensive to clean up in the first place. So when you add it all up, consolidation savings, reduce incident cost, improve staff productivity, the return on investment on a unified observability platform comes into focus pretty quickly. Josh, what about the impact on citizen services specifically? This is where it becomes mission critical. Public sector IT isn't like enterprise IT. When a state benefits portal goes down, citizens can't access services And the predictive capabilities mean And the predictive capabilities mean agencies can anticipate and prevent issues before citizens are impacted. That's the ultimate goal, not just faster incident response, but fewer incidents in the first place. I'd add that this connects directly to the mission, you know. Every minute of downtime on a system facing application is a minute that somebody can't access benefits or renew a license, can't complete a transaction. And so when we improve observability, we're not just improving IT metrics, you know, we're we're improving government service delivery. Let's bring it home, because we've covered a lot of ground today. The visibility gap in public sector IT is real. Tool sprawl, siloed teams, disconnected monitoring are causing SNS to last longer and cost more than they should. Half forward is clear. Unified observability that consolidates tools, breaks down silos, and leverages AI to move from reactive to proactive operations. SolarWinds observability provides that unified platform with the deployment flexibility public sector agencies need and won. Brian, what's the one thing you want people to take away today? Yeah. I I think here's the bottom line up front for me. In in episode one, we said that you can't protect what you can't see. And and today, I would extend that. You can't fix what you can't correlate. So unified observability isn't just a security imperative, it's an operational imperative. Josh, from a product standpoint, what should agencies be thinking about? Start by understanding your current tool landscape and where the gaps are, then look for a platform that can consolidate monitoring across all domains, network, application, database, security with AI driven intelligence built in. Don't settle for another point tool that creates another silo. And think about the future. AI and AI ops capabilities are evolving rapidly. The platform you choose today should be positioned to take advantage of those advances. Scott, what should agencies do next? So three things. First, you need to inventory your current monitoring tools. You might be surprised at how many you have and how how much overlap actually exists out there. Second, you go to maybe your mean time to detect it to detect and resolve. You know, that baseline is important because that's gonna show your improvement. And third, reach out to us. You know, Monolithic, the professional services are the sole that does assessment specifically designed for public sector agencies. You know, we'll map your current state, identify consolidation opportunities, and build a migration plan that gets you to unified observability without disrupting the operations. And the SolarWinds black black community is an outstanding source for you know, to tell you what your peers are doing as well and can really help you with best practices. Josh Stogaberg, Scott Prost, Brian Chamberlain, thanks for an excellent discussion today. To our audience, thank you for joining episode two. In our next episode, we'll continue to build on these themes throughout. Have a great day, everyone.