elk吞吐量调优

今天压了一下日志接口，发现只有大概400qps,其中一定有环节出现性能瓶颈，一个字查~

接口服务使用openresty写的，服务本身不存在吞吐量瓶颈,应该是elk的配置问题。

elk环节比较长filebeat->logstash->es，如何找到哪个环节存在瓶颈？

首先查看es的线程池信息，reject数为0，故es不存在瓶颈

http://x.x.x.x:9200/_nodes/stats/thread_pool?pretty

filebeat只是简单读取文件并发送数据，应该也不存在问题，那么重点就在logstash

打开logstash配置文件，发现logstash中存在stdout {codec=>rubydebug},该语句表示将过滤内容输出到标准输出，通常是调试时使用的，我们将该语句去除后，吞吐量增长至4500左右。此时吞吐量已经比较接近openresty接口的qps，暂时算是满足需求了。

如何进一步提高吞吐量？

横向扩展openresty接口
部署多logstash，实现负载均衡(filebeat支持)
当前情况是logstash与es同机部署，该机器在压测试，cpu占用率远超100%，将logstash与es拆开部署可进一步提升性能
es集群化
es使用ssd
大数据量时从按天划分索引改为按小时
优化logstash的grok，尽可能少的采用DATA

问题一
最近发现kibana的日志传的很慢，常常查不到日志，由于所有的日志收集都只传输到了一个logstash进行收集和过滤，于是怀疑是否是由于logstash的吞吐量存在瓶颈。一看，还真是到了瓶颈。
优化过程
经过查询logstash完整配置文件，有几个参数需要调整
# pipeline线程数，官方建议是等于CPU内核数
pipeline.workers: 24
# 实际output时的线程数
pipeline.output.workers: 24
# 每次发送的事件数
pipeline.batch.size: 3000
# 发送延时
pipeline.batch.delay: 5
PS:由于我们的ES集群数据量较大（>28T），所以具体配置数值视自身生产环境
优化结果
ES的吞吐由每秒9817/s提升到41183/s,具体可以通过x-pack的monitor查看。
问题二
在查看logstash日志过程中，我们看到了大量的以下报错
[2017-03-18T09:46:21,043][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$6@6918cf2e on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@55337655[Running, pool size = 24, active threads = 24, queued tasks = 50, completed tasks = 1767887463]]"})
[2017-03-18T09:46:21,043][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
查询官网，确认为时ES的写入遇到了瓶颈
Make sure to watch for TOO_MANY_REQUESTS (429) response codes (EsRejectedExecutionException with the Java client), which is the way that Elasticsearch tells you that it cannot keep up with the current indexing rate. When it happens, you should pause indexing a bit before trying again, ideally with randomized exponential backoff.
我们首先想到的是来调整ES的线程数，但是官网写到”Don’t Touch There Settings!”, 那怎么办？于是乎官方建议我们修改logstash的参数pipeline.batch.size
在ES5.0以后，es将bulk、flush、get、index、search等线程池完全分离，自身的写入不会影响其他功能的性能。
来查询一下ES当前的线程情况：
GET _nodes/stats/thread_pool?pretty
{
  "_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "cluster_name": "dev-elasticstack5.0",
  "nodes": {
    "nnfCv8FrSh-p223gsbJVMA": {
      "timestamp": 1489804973926,
      "name": "node-3",
      "transport_address": "192.168.3.***:9301",
      "host": "192.168.3.***",
      "ip": "192.168.3.***:9301",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "r1"
      },
      "thread_pool": {
        "bulk": {
          "threads": 24,
          "queue": 214,
          "active": 24,
          "rejected": 30804543,
          "largest": 24,
          "completed": 1047606679
        },
        ......
        "watcher": {
  "threads": 0,
  "queue": 0,
  "active": 0,
  "rejected": 0,
  "largest": 0,
  "completed": 0
}
}
}
}
}
其中：”bulk”模板的线程数24，当前活跃的线程数24，证明所有的线程是busy的状态，queue队列214，rejected为30804543。那么问题就找到了，所有的线程都在忙，队列堵满后再有进程写入就会被拒绝，而当前拒绝数为30804543。
优化方案
问题找到了，如何优化呢。官方的建议是提高每次批处理的数量，调节传输间歇时间。当batch.size增大，es处理的事件数就会变少，写入也就越快了。
vim /etc/logstash/logstash.yml
#
pipeline.workers: 24
pipeline.output.workers: 24
pipeline.batch.size: 10000
pipeline.batch.delay: 10
具体的worker/output.workers数量建议等于CPU数，batch.size/batch.delay根据实际的数据量逐渐增大来测试最优值。