Skip to content

restores highly available service#6153

Open
keith-turner wants to merge 5 commits intoapache:mainfrom
keith-turner:highly-avail-service
Open

restores highly available service#6153
keith-turner wants to merge 5 commits intoapache:mainfrom
keith-turner:highly-avail-service

Conversation

@keith-turner
Copy link
Contributor

@keith-turner keith-turner commented Feb 25, 2026

Restored highly available service in order to support multiple managers #6139 . This change should also avoid clients that connect to a non-primary manager from getting stuck. Made all thrift methods throw ThriftNotActiveServiceException, if the method does not declare this it will cause a server side exception. There was a single one way thrift method that could not throw this, so it would still cause server side exceptions if its called. Delaying advertising should prevent this in most cases.

Restored highly available service in order to support multiple managers.
This change should also avoid clients that connect to a non-primary
manager from getting stuck.  Made all thrift methods throw
ThriftNotActiveServiceException, if the method does not declare this it
will cause a server side exception.  There was a single one way thrift
method that could not throw this, so it would still cause server side
exceptions if its called.  Delaying advertising should prevent this in
most cases.
@keith-turner keith-turner added this to the 4.0.0 milestone Feb 25, 2026
@keith-turner
Copy link
Contributor Author

Now that the compaction coordinator can throw ThriftNotActiveServiceException the compactors should retry if they see this exception. Looked at the compactors and it seems like they will retry as all of their calls use RetryableThriftCall.

@dlmarion
Copy link
Contributor

I wonder if there is a way to automate a check that all Thrift RPC methods handled by the Manager throws the ThriftNotActiveServiceException

@keith-turner
Copy link
Contributor Author

I wonder if there is a way to automate a check that all Thrift RPC methods handled by the Manager throws the ThriftNotActiveServiceException

Not sure this functionality has any tests. There is the server side aspect of throwing the exception, if we could automate test for that it would be nice. Then there is the client side code that retries when it sees this exception, that code is sprinkled all over the place. The server side code seems easier to test. For the client side code, maybe the best way to test is to run random walk tests while starting/stopping managers.

@keith-turner
Copy link
Contributor Author

keith-turner commented Feb 25, 2026

For the client side code, maybe if we refactored to use more common code on the client side then we could focus on testing that common code. I noticed the compactor code did not need to change because it uses common code for retry. That common code retries on TException though which may be too broad for the general case, like we probably should not retry on a thrift security exception.

@dlmarion
Copy link
Contributor

I looked at trying to write a test for this, but could not figure it out quickly. Another option would be to fail at runtime on the server side, which should be caught during testing. To do this, we could include the following method in the ThriftProcessorTypes class, then call it for each interface passed to the getManagerTProcessor method.

  private void validateHAServerExceptions(Class<?> thriftInterface) {
    String className = thriftInterface.getClass().getName();
    Method[] methods = thriftInterface.getClass().getMethods();
    for (Method m : methods) {
      Class<?>[] exceptionClasses = m.getExceptionTypes();
      if (exceptionClasses.length == 0) {
        throw new IllegalStateException("Method " + m.getName() + " on " + className + ""
            + " does not declare ThriftNotActiveServiceException to be thrown");
      }
      boolean found = false;
      for (Class<?> ec : exceptionClasses) {
        if (ThriftNotActiveServiceException.class.getName().equals(ec.getClass().getName())) {
          found = true;
          break;
        }
      }
      if (!found) {
        throw new IllegalStateException("Method " + m.getName() + " on " + className + ""
            + " does not declare ThriftNotActiveServiceException to be thrown");        
      }
    }
  }

@keith-turner
Copy link
Contributor Author

Re using reflection to look for absence of the exception, thrift one way methods can not throw an exception. Not sure if we can detect if a method is one way. If we could that would be nice in the HAService code, it could just ignore oneway thrift calls when not the primary instead of throwing an exception.

@dlmarion
Copy link
Contributor

It looks like the information that's needed is in the Processor class, specifically a subclass that has the same name as the method, then you call isOneWay on it.

For example, for the ManagerClientService.Iface.waitForFlush method, there is a ManagerClientService.Processor.waitForFlush class, and its isOneWay method returns false.

@keith-turner
Copy link
Contributor Author

@dlmarion in 8be2acf validated exceptions and handled one way methods.

if (onewayMethods.contains(method.getName())) {
// if thrift one way method throws an exception it will just log an error
LOG.debug("Ignoring one way thrift call because not active : {} {}", method.getName(),
Arrays.asList(args));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to return on this line, or it will fall through and invoke the method.

if (onewayMethods.contains(method.getName())) {
// if thrift one way method throws an exception it will just log an error
LOG.debug("Ignoring one way thrift call during upgrade : {} {}", method.getName(),
Arrays.asList(args));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to return on this line, or it will fall through and invoke the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that would have spammed the logs w/ errors. Fixed in d5670e6

try {
method = ProcessFunction.class.getDeclaredMethod("isOneway");
} catch (NoSuchMethodException e) {
throw new IllegalStateException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to log the class name for debugging purposes to find out why it doesn't have the declared method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants