The challenge of monitoring construction-I will search up and down.
1. Missing key system indicators
Monitoring construction has always been an ongoing process, and there is no one-and-done solution. In the continuous monitoring operation and maintenance, we continue to enrich and improve the relevant monitoring, the common system and application layer monitoring indicators are as follows.
From the above figure, we can see that the specific users of monitoring have a very broad collection of monitoring indicators. No matter which monitoring system is on the market, the default monitoring indicators provided by it may not meet the actual scene requirements. With the continuous operation and business development of the monitoring system, the need to collect more monitoring indicators will become more and more urgent. Therefore, we hope that the monitoring system can provide the ability to expand freely and flexibly customize.
2. Function expansion is difficult
In the process of continuous monitoring system construction, we constantly improve the collection index according to the actual demand. Therefore, whether the monitoring platform supports a variety of collection methods natively may limit our ability to play, such as monitoring network, storage and other hardware devices, we must use SNMP protocol to obtain monitoring index data, which should be a basic capability of the platform. If we have to implement it from scratch, it is equivalent to writing a small monitoring software. Therefore, this monitoring system should provide scalability, either open source code or open interface, we can expand modules and components according to actual needs, so as to meet the needs of our business development.
With the continuous expansion of monitoring indicators, the amount of data that needs to be stored for indicator data will also increase. At this time, there are very high requirements for the QPS of the monitoring system. Whether the monitoring system can support highly concurrent requests directly determines whether the monitoring system can be used. Just imagine, if a monitoring system goes wrong in three days and two ends, then users may lose, or even give up the use of this monitoring system and seek a better solution instead.
When monitoring user usage, the expected SLA for the monitoring platform may be 100 percent, while the actual SLA that can be achieved may be 99.9 percent (about 9 hours of downtime per year). With the continuous development of business and technology, monitoring users have higher and higher requirements for monitoring systems. Can SLA continue to improve?
In fact, the increase of SLA by 1 9 poses great challenges to the system, such as whether our architecture supports it, whether the architecture design is reasonable, whether the architecture is redundant, whether it can support horizontal expansion, whether there is a single point of failure, whether the server resources are sufficient, whether the concurrency of the system is a straight line, and many other factors directly determine whether the SLA of the monitoring system we provide can continue to improve. Ideally, the architecture is redundant. When the link fails, it can be automatically switched and replaced by a backup machine. When the capacity is insufficient, the capacity can be expanded by adding servers, and the load can be balanced automatically.
Therefore, when we design the monitoring system, we must learn the high concurrency, high availability and distributed architecture design scheme in the Internet architecture. Since then, whether it is the addition of functional modules or the overall upgrade of the system, with the guarantee of the architecture, the upgrade and expansion can be carried out on demand without worrying about the availability of the system, and the upgrade and change expansion is imperceptible to users.
System reliability is not guaranteed.
When the scale of the monitoring host reaches 5000 equipment and 10000 equipment, the general monitoring system will have bottlenecks. The QPS of the system continues to grow. Whether it can support stable operation for 7*24 hours and 365 days is a very big challenge. It can be said that the monitoring system has always been a high concurrent system, and at the same time, it is also a large database system, such as increasing 5T,10T and 50T data, and requires detailed historical data, the storage period needs 7 days, 30 days, or even 1 year, while the trend data (archiving historical data, such as hourly max,min,avg storage) requires 1 year, 2 years or more, then the data of the monitoring system may reach the PB data level, and its data processing method is similar to that of a large number of large data processing systems, collection-> Cleaning-> Analysis-> Storage-> Use.
There are three general reasons for the delay in data reporting. First, there is a problem with the collector, which cannot collect data according to the established cycle, or because the original data does not exist, or because the collector reaches its own performance upper limit. Second, the cleaning of the monitoring system and the processing and analysis links are blocked, which shows that the collection and reporting are normal and the data are not put into storage. Third, after data processing, normal warehousing is not allowed, that is, there is a problem with the monitored database, which is manifested as slow data writing, slow query, and exceeding the upper limit of the database. These three situations, no matter which, are not available to users.
No false alarm, no omission, no delay, this is the basic requirement of the monitoring system. False positives are data processing problems that reduce users' trust in monitoring. If false positives exist for a long time, users will lose their trust in monitoring and gradually abandon the monitoring system. Missing report is the alarm that should have been issued but not sent out. This situation is even more serious, which seriously affects the normal use of users and seriously reduces the expectations of users. It is like a late plane that cannot reach the destination on time. Delay means that the alarm is generated now and will not be received until tomorrow, which indicates that the monitoring system is already in an unavailable state. At the time of the fault, the alarm cannot be received. At the end of the fault, the alarm will be sent out, and the user will completely distrust the monitoring system. If the monitoring system does not even do the basic thing of alarm, then it is not a qualified monitoring system, and users will treat the monitoring system as a noise.
When we report the data and solve the alarm problems, the system can work normally and will face new problems. Imagine that the alarm module works normally, the user receives 1000 alarms every day, and even receives 10000 alarms, the user will go crazy, which is simply the alarm "bomber", too many alarms, become noise, interfere with normal judgment. Therefore, the issue of whether the alarm can converge has become a top priority.
Alarm convergence refers to an alarm sending method that combines and sends multiple alarms with the same policy and different target ranges and converges according to certain rules. For example, a network failure in one of our cloud areas will cause all devices in the area to be unreachable, so ping unreachable alarms will be sent one by one. If there are 1000 machines in the area, is it better to send one alarm or 1000? I believe most normal people only want to receive one important alarm. The convergence of the alarm will greatly improve the accuracy of the alarm. It will really enable us to strategize without panic. It will not make us nervous every day and be in a situation where wolves are coming every day. Because of too many alarms and the effect of boiling frogs in warm water, we will gradually lose our sensitivity to the alarm and gradually pay no attention to the alarm because of too many alarms.
With alarm convergence, can you rest easy? No, we also need fault correlation and automatic fault analysis. Why do we need this function? Just think, a rack power failure caused all 15 devices to go down, thus causing a series of failures, such as API timeout, HTTP dial test failure, DB connection number growth, then can you find root-cause? Can you provide an important alarm to help us automatically analyze the root cause of the failure? At this point, the fault correlation and fault automatic analysis, it is particularly important. Therefore, the monitoring system must have the ability of fault correlation analysis to provide more accurate information for our operation and maintenance decisions.
In addition, whether the monitoring system can analyze the performance of the current environment and analyze the system capacity will also be an important capability, such as predicting the trend, when we should expand the service server, when we should shrink the server, whether the current performance is sufficient, and whether there is room for optimization. The monitoring system, because there is data, can provide this series of data as an important basis.
Technology lags behind business development
With the continuous development of the business, the organizational structure will be adjusted according to the business form, different personnel need to correspond to different permissions, need to divide more roles, such as super administrator, hierarchical administrator, ordinary administrator, ordinary user, and even for the menu button level more fine-grained permission control requirements. If the monitoring system does not take these needs into account at the beginning, it is difficult to cope with the growth needs of the business. With the development of business, in order to reduce costs, the company can outsource some regular matters, at this time, the hierarchical control of permissions is particularly important. At this time, the monitoring system is no longer an isolated system, must be integrated with the unified user login authentication system, to achieve configuration, query separation, in order to meet the organization's business development.
With the continuous development of the business, the business puts forward higher requirements for the monitoring system, such as "microsecond-level monitoring data sampling" and "second-level alarm", requiring the monitoring system to provide 100 percent reliable information, and expanding or shrinking the target server and container of the business application according to the system capacity index data provided by the monitoring system. The monitoring system, as an underlying dependency system, will be of greater value at this time.


