logstash无法写ES的灵异事件

问题描述

快晚饭时,突然QQ群被同时@,”xxx,为啥ES查不到新的数据了?是Logstash挂了?”
邪门,ES平稳运行这么长时间咋还说挂就挂,是不是logstash有啥问题?打开logstash日志看到了密密麻麻的错误信息

1
2
3
4
5
6
...
[2019-10-24T10:51:15,601][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2019-10-24T10:51:15,929][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://127.0.0.1:9200/, :error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2019-10-24T10:51:15,932][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2019-10-24T10:51:15,959][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://127.0.0.1:9200/, :error_message=>"Elasticsearch Unreachable: [http://127.0.0.1:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
...

问题解决

简单翻译一下就是logstash想通过bulk接口向ES发送数据,结果ES无法到达
奇怪,这什么玩意?继续查,http://127.0.0.1:9200/_cat/thread_pool?v

1
2
3
node_name name active queue rejected
node-x.x.x.x bulk 8 988 12338
...

我擦,8线程全全部在运行(active 8),queue达988,表示超过了消费能力,甚至已经造成了拒绝服务!
乖乖,谁啊,这么狠? logstash停掉后,发现queue依然很高,突然想到有个同事绕过logstash对ES直接操作,电话后果然是他,冤有头债有主,赶紧改= =