Skip to main content

Transforming DevOps With AI: Practical Strategies To Supercharge Your Workflows

· 5 min read
Tibo Frans
Bronnen

Bron: artikel integraal overgenomen van DevOps.com
Origineel auteur: Junaid Jagalur

supercharge-workflow-with-ai

As someone interested in DevOps, I wondered how all the AI advances could benefit my field. OpenAI used deep learning to release groundbreaking products like ChatGPT and Sora. Microsoft used similar technologies to revamp its products, notably enhancing GitHub with Copilot. Many startups have sprung up, and large tech companies have poured billions into AI research.

Ultimately, how can engineers use this technology to reduce toil, add value to the software development lifecycle (SDLC), and increase development velocity? There are a few interesting options.

Random Forest Classifier for Self-Healing Systems

Say you have a distributed systems environment with lots of pods serving different services. You also have some observability tooling, like Prometheus, to provide a stream of system metrics, such as CPU usage, memory usage, disk I/O, network statistics and even container logs.

Random Forests can help you build a self-healing system. With some creative feature engineering to get rolling averages, rates of change, and error log counts, the Random Forest Classifier can generate corrective actions. For example, it can roll back services if pods fail to start after a deployment or increase pods on a service if CPU and network queries per second (QPS) exceed a threshold.

Since a Random Forest Classifier is an ensemble of many decision trees, each trained on a subset of the features, it can effectively model non-linear data while minimizing overfitting by averaging multiple decision trees and providing multi-class outputs. That’s why I would prefer them over support vector machines (SVMs) and decision trees which don’t provide all those advantages.

LSTMs for Anomaly Detection Based Smart Alerting

Metrics such as CPU usage and network stats like QPS are continuously emitted by a service. You can naturally interpret them as time series and use a long short-term memory (LSTM) neural network to find anomalies. LSTMs are a type of recurrent neural network (RNN) that are good at “remembering” long-term events.

An LSTM trained on, say, CPU utilization data can ignore seasonal events such as nightly backups and unimportant events like minor rises during maintenance events such as host-to-host migrations. It also can alert on things like an unexplained rise during off-peak hours caused by a security breach.

For example, you could train the LSTM on historical data of overall traffic QPS. Once deployed, it references the current QPS and predicts future values that you can later compare to the real future values. If there’s too large of a difference, it sets off an alert. The detected anomalies can also be cataloged and used to fine-tune the model to increase performance over time.

Q-Learning for Autonomous System Configuration Updates

Getting a bit more advanced, Q-Learning is a type of reinforcement learning algorithm that can look at performance metrics, logs, and events such as restarts and deployments, and then update the system configuration. For example, if the metrics point to strained CPUs, the algorithm can decide whether to scale up the pods by adding additional CPUs or spin up additional pods. A reinforcement learning agent can learn from the action’s outcome and adjust its policy without human intervention.

To build this, you would have to define the states (various performance metrics), actions (scale up pods, etc.), and rewards (system performance) following an action. Then the model can be trained on historical data without much manual labeling, and it could be back-tested to ensure good performance.

LLMs for System Management With Natural Language

I bet you know by now what a large language model (LLM) is and how powerful LLMs can be. While LLM-based models can only be trained by well-resourced companies like OpenAI, the APIs for these models are available for public use, albeit at a steep price. But if you do have access to an LLM, you can build a bot that accepts natural language instructions, like “Set up a new service using X image in the development environment and write a deployment report,” after which a well-integrated LLM can carry out the actions.

If an LLM is not at hand, Bidirectional Encoder Representations from Transformer (BERT) is another model used for Sequence-to-Sequence (seq2seq) use cases. It can accomplish similar things, in a more limited but more controlled way. For example, “Restart the web server on node 5” would translate to {“action”: “restart”, “target”: “web server”, “node”: 5} which can be interpreted by a traditional program.

Looking Forward

AI is changing the face of product development and has become a common fixture of software and technology in general. While it’s exciting to use the cutting edge of AI, it’s also enlightening to learn how it can be used directly to reduce toil and improve business outcomes. Simple models can achieve this end by offering proven results for smaller investments, while larger models could revolutionize the SDLC at your company. The only wrong choice is not thinking about it at all and continuing with business as usual.