这个参数网上的解释都是集群ping过程的超时等待时间。
但是我发现这个参数配置越大,选主的过程越长,我配置了一分钟,结果每次主节点重启的时候整个集群都会有一段时间不可用,而且选主过程非常慢;但是当我设置了10s后,选主过程快了很多,虽然也会抛异常但是很快就能选举出主节点。
然后看了下代码,发现ping的回调函数确实需要等待discovery.zen.ping_timeout 这个配置对应的时间才会返回。代码如下:
ZenDiscovery类的findMaster开头有这么一句,就是选主的方法调用。
ZenPing.PingResponse[] fullPingResponses = pingService.pingAndWait(pingTimeout);
这个方法的执行时间也就是选主所需要的时间,然后接着看
pingAndWait方法里面执行了ping,并等待回调通知后再继续执行,所以timeout究竟做了什么呢?
这个是ping的实现,其实就是把所有节点ping了一遍,具体看try-catch的那个ping调用:
前面注意到,我们的findmaster中的选主时间是由pingAndWait 这个方法决定的,而这个方法一直在等待onPing回调的执行,所以onPing执行完才会结束。所以我们只要关注PingListener的onPing什么时候触发,就知道什么时候选主完成了。
很显然,是在scheduler中执行的,但是看下threadPool.schedule,这个本身就是ScheduledThreadPoolExecutor的包装,其第一个参数对应的就是ScheduledThreadPoolExecutor的delay,也就算是延迟多久执行,很显然他传递的是(timeout.millis() / 2),一半的discovery.zen.ping_timeout对应的时间。
在就是sendPings,这个方法也设置了等待时间,点进去看的话会发现等待时间也是一半的ping_time.
注意最后这个if,如果waitTime!=null 则latch.await。外面传递的waitTime是一半的ping_time哦。
所以初步得出结论 ping_time 代表的是ping请求调用超时时间,但同时也是选主的delay time。
社区的同学们你们怎么理解的呢?
但是我发现这个参数配置越大,选主的过程越长,我配置了一分钟,结果每次主节点重启的时候整个集群都会有一段时间不可用,而且选主过程非常慢;但是当我设置了10s后,选主过程快了很多,虽然也会抛异常但是很快就能选举出主节点。
然后看了下代码,发现ping的回调函数确实需要等待discovery.zen.ping_timeout 这个配置对应的时间才会返回。代码如下:
ZenDiscovery类的findMaster开头有这么一句,就是选主的方法调用。
ZenPing.PingResponse[] fullPingResponses = pingService.pingAndWait(pingTimeout);
这个方法的执行时间也就是选主所需要的时间,然后接着看
public PingResponse[] pingAndWait(TimeValue timeout) {
final AtomicReference<PingResponse[]> response = new AtomicReference<>();
final CountDownLatch latch = new CountDownLatch(1);
ping(new PingListener() {
@Override
public void onPing(PingResponse[] pings) {
response.set(pings);
latch.countDown();
}
}, timeout);
try {
latch.await();
return response.get();
} catch (InterruptedException e) {
logger.trace("pingAndWait interrupted");
return null;
}
}
pingAndWait方法里面执行了ping,并等待回调通知后再继续执行,所以timeout究竟做了什么呢?
@Override
public void ping(PingListener listener, TimeValue timeout) {
List<? extends ZenPing> zenPings = this.zenPings;
CompoundPingListener compoundPingListener = new CompoundPingListener(listener, zenPings);
for (ZenPing zenPing : zenPings) {
try {
zenPing.ping(compoundPingListener, timeout);
} catch (EsRejectedExecutionException ex) {
logger.debug("Ping execution rejected", ex);
compoundPingListener.onPing(null);
}
}
}
这个是ping的实现,其实就是把所有节点ping了一遍,具体看try-catch的那个ping调用:
@Override
public void ping(final PingListener listener, final TimeValue timeout) {
final SendPingsHandler sendPingsHandler = new SendPingsHandler(pingHandlerIdGenerator.incrementAndGet());
try {
receivedResponses.put(sendPingsHandler.id(), sendPingsHandler);
try {
sendPings(timeout, null, sendPingsHandler);
} catch (RejectedExecutionException e) {
logger.debug("Ping execution rejected", e);
// The RejectedExecutionException can come from the fact unicastConnectExecutor is at its max down in sendPings
// But don't bail here, we can retry later on after the send ping has been scheduled.
}
threadPool.schedule(TimeValue.timeValueMillis(timeout.millis() / 2), ThreadPool.Names.GENERIC, new AbstractRunnable() {
@Override
protected void doRun() {
sendPings(timeout, null, sendPingsHandler);
threadPool.schedule(TimeValue.timeValueMillis(timeout.millis() / 2), ThreadPool.Names.GENERIC, new AbstractRunnable() {
@Override
protected void doRun() throws Exception {
sendPings(timeout, TimeValue.timeValueMillis(timeout.millis() / 2), sendPingsHandler);
sendPingsHandler.close();
listener.onPing(sendPingsHandler.pingCollection().toArray());
for (DiscoveryNode node : sendPingsHandler.nodeToDisconnect) {
logger.trace("[{}] disconnecting from {}", sendPingsHandler.id(), node);
transportService.disconnectFromNode(node);
}
}
@Override
public void onFailure(Throwable t) {
logger.debug("Ping execution failed", t);
sendPingsHandler.close();
}
});
}
@Override
public void onFailure(Throwable t) {
logger.debug("Ping execution failed", t);
sendPingsHandler.close();
}
});
} catch (EsRejectedExecutionException ex) { // TODO: remove this once ScheduledExecutor has support for AbstractRunnable
sendPingsHandler.close();
// we are shutting down
} catch (Exception e) {
sendPingsHandler.close();
throw new ElasticsearchException("Ping execution failed", e);
}
}
前面注意到,我们的findmaster中的选主时间是由pingAndWait 这个方法决定的,而这个方法一直在等待onPing回调的执行,所以onPing执行完才会结束。所以我们只要关注PingListener的onPing什么时候触发,就知道什么时候选主完成了。
很显然,是在scheduler中执行的,但是看下threadPool.schedule,这个本身就是ScheduledThreadPoolExecutor的包装,其第一个参数对应的就是ScheduledThreadPoolExecutor的delay,也就算是延迟多久执行,很显然他传递的是(timeout.millis() / 2),一半的discovery.zen.ping_timeout对应的时间。
在就是sendPings,这个方法也设置了等待时间,点进去看的话会发现等待时间也是一半的ping_time.
void sendPings(final TimeValue timeout, @Nullable TimeValue waitTime, final SendPingsHandler sendPingsHandler) {
final UnicastPingRequest pingRequest = new UnicastPingRequest();
pingRequest.id = sendPingsHandler.id();
pingRequest.timeout = timeout;
DiscoveryNodes discoNodes = contextProvider.nodes();
pingRequest.pingResponse = createPingResponse(discoNodes);
HashSet<DiscoveryNode> nodesToPingSet = new HashSet<>();
for (PingResponse temporalResponse : temporalResponses) {
// Only send pings to nodes that have the same cluster name.
if (clusterName.equals(temporalResponse.clusterName())) {
nodesToPingSet.add(temporalResponse.node());
}
}
for (UnicastHostsProvider provider : hostsProviders) {
nodesToPingSet.addAll(provider.buildDynamicNodes());
}
// add all possible master nodes that were active in the last known cluster configuration
for (ObjectCursor<DiscoveryNode> masterNode : discoNodes.getMasterNodes().values()) {
nodesToPingSet.add(masterNode.value);
}
// sort the nodes by likelihood of being an active master
List<DiscoveryNode> sortedNodesToPing = electMasterService.sortByMasterLikelihood(nodesToPingSet);
// new add the the unicast targets first
List<DiscoveryNode> nodesToPing = CollectionUtils.arrayAsArrayList(configuredTargetNodes);
nodesToPing.addAll(sortedNodesToPing);
final CountDownLatch latch = new CountDownLatch(nodesToPing.size());
for (final DiscoveryNode node : nodesToPing) {
// make sure we are connected
final boolean nodeFoundByAddress;
DiscoveryNode nodeToSend = discoNodes.findByAddress(node.address());
if (nodeToSend != null) {
nodeFoundByAddress = true;
} else {
nodeToSend = node;
nodeFoundByAddress = false;
}
if (!transportService.nodeConnected(nodeToSend)) {
if (sendPingsHandler.isClosed()) {
return;
}
// if we find on the disco nodes a matching node by address, we are going to restore the connection
// anyhow down the line if its not connected...
// if we can't resolve the node, we don't know and we have to clean up after pinging. We do have
// to make sure we don't disconnect a true node which was temporarily removed from the DiscoveryNodes
// but will be added again during the pinging. We therefore create a new temporary node
if (!nodeFoundByAddress) {
if (!nodeToSend.id().startsWith(UNICAST_NODE_PREFIX)) {
DiscoveryNode tempNode = new DiscoveryNode("",
UNICAST_NODE_PREFIX + unicastNodeIdGenerator.incrementAndGet() + "_" + nodeToSend.id() + "#",
nodeToSend.getHostName(), nodeToSend.getHostAddress(), nodeToSend.address(), nodeToSend.attributes(), nodeToSend.version()
);
logger.trace("replacing {} with temp node {}", nodeToSend, tempNode);
nodeToSend = tempNode;
}
sendPingsHandler.nodeToDisconnect.add(nodeToSend);
}
// fork the connection to another thread
final DiscoveryNode finalNodeToSend = nodeToSend;
unicastConnectExecutor.execute(new Runnable() {
@Override
public void run() {
if (sendPingsHandler.isClosed()) {
return;
}
boolean success = false;
try {
// connect to the node, see if we manage to do it, if not, bail
if (!nodeFoundByAddress) {
logger.trace("[{}] connecting (light) to {}", sendPingsHandler.id(), finalNodeToSend);
transportService.connectToNodeLight(finalNodeToSend);
} else {
logger.trace("[{}] connecting to {}", sendPingsHandler.id(), finalNodeToSend);
transportService.connectToNode(finalNodeToSend);
}
logger.trace("[{}] connected to {}", sendPingsHandler.id(), node);
if (receivedResponses.containsKey(sendPingsHandler.id())) {
// we are connected and still in progress, send the ping request
sendPingRequestToNode(sendPingsHandler.id(), timeout, pingRequest, latch, node, finalNodeToSend);
} else {
// connect took too long, just log it and bail
latch.countDown();
logger.trace("[{}] connect to {} was too long outside of ping window, bailing", sendPingsHandler.id(), node);
}
success = true;
} catch (ConnectTransportException e) {
// can't connect to the node - this is a more common path!
logger.trace("[{}] failed to connect to {}", e, sendPingsHandler.id(), finalNodeToSend);
} catch (RemoteTransportException e) {
// something went wrong on the other side
logger.debug("[{}] received a remote error as a response to ping {}", e, sendPingsHandler.id(), finalNodeToSend);
} catch (Throwable e) {
logger.warn("[{}] failed send ping to {}", e, sendPingsHandler.id(), finalNodeToSend);
} finally {
if (!success) {
latch.countDown();
}
}
}
});
} else {
sendPingRequestToNode(sendPingsHandler.id(), timeout, pingRequest, latch, node, nodeToSend);
}
}
if (waitTime != null) {
try {
latch.await(waitTime.millis(), TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
// ignore
}
}
}
注意最后这个if,如果waitTime!=null 则latch.await。外面传递的waitTime是一半的ping_time哦。
所以初步得出结论 ping_time 代表的是ping请求调用超时时间,但同时也是选主的delay time。
社区的同学们你们怎么理解的呢?
4 个回复
kennywu76 - Wood
赞同来自: code4j 、famoss 、kwan 、PerryPcd
我们生产环境discovery.zen.ping_timeout用的默认的3s, 而discovery.zen.fd相关的几个参数设置如下:
这些参数在我们的环境长期运行后验证基本是比较理想的。 只有负载最重的日志集群,在夜间做force merge的时候,因为某些shard过大(300 - 400GB), 大量的IO操作因为机器load过高,偶尔出现结点被误判脱离,然后马上又加回的现象。 虽然继续增大上面的几个参数可以减少误判的机会,但是如果真的有结点故障,将其剔除掉的周期又太长。 所以我们还是通过增加shard数量,限制shard的size来缓解forcemerge带来的压力,降低高负载结点被误判脱离的几率。
kennywu76 - Wood
赞同来自: code4j 、wayne
假设discovery.zen.ping_timeout是默认的3s, 并且所有结点都正常工作,立即响应ping请求。那么上述步骤耗时大致应该为:
即大约3s完成,也就是选主过程基本和timeout时长一致。
再假设只有第一轮检测timeout,后面两轮顺利,则这个过程耗时应该大致为:
总共是6s。
所以discovery.zen.ping_timeout 这个参数设置比较大,可以减少master因为负载过重掉出集群的风险。 但同时如果master真出问题了,重新选举过程会很长。
PS. 我在我本地机器上,用默认的3s超时设置,测试将master停止,选出新master的耗时差不多就是3s多一点点。
yayg2008
赞同来自:
所以我的建议是这个参数保持默认3秒就好。至于节点脱离问题,其实是由另一个参数 discovery.zen.fd.ping_timeout 控制的。
liuliuliu
赞同来自: