-
Notifications
You must be signed in to change notification settings - Fork 25
Lhm progress struct #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Lhm progress struct #116
Conversation
|
I was encountering some git rebase issues in the previous PR - #113, and was completely messed up. So I decided to open a new PR, with the updated code based on the few comments I received in the previous PR. Apologies for this. |
8efaeac to
bad8fde
Compare
| @printer = Lhm::Printer::Percentage.new | ||
| end | ||
|
|
||
| it 'prints the percentage' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printer.notify now just logs the message, so didn't felt much value in having tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this test here to not break the interface for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is still removed and not reinstated.
shuhaowu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks OK, but there are a few important issues to resolve:
- We shouldn't break
printer.notify's API. - We should pass the progress struct hash directly out and let the printer print, instead of passing it via a string and having to parse it later.
- I'm a bit concerned about the way we estimate progress, based on an internal sum as opposed to checking information schema every time.
lib/lhm/chunker.rb
Outdated
| @chunk_finder.validate | ||
| end | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespace is still here? Should we setup something like rubocop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that might be good. I will set it up in a follow up PR.
lib/lhm/progress.rb
Outdated
| STATE_COPYING = "copying" | ||
| STATE_COPYING_DONE = "copying_done" | ||
| STATE_COPYING_FAILED = "copying_failed" | ||
| STATE_INITIAL = "initial" | ||
| STATE_SETUP_FAILED = "setup_failed" | ||
| STATE_SETUP_DONE = "setup_done" | ||
| STATE_SWITCHING_TABLES = "switching_tables" | ||
| STATE_SWITCHED_TABLES = "switched_tables" | ||
| STATE_SWITCHING_TABLES_FAILED = "switching_tables_failed" | ||
| STATE_TRIGGERS_DROPPED = "trigggers_dropeed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of states. Do we need this much? It looks like we'll spend the majority of the time in copying and anything polling the state is unlikely to ever hit the other states. If that's the case, the updating of some of these states is more like a log line than switching the states. The other thing is that there's no "state machine" per say. It's more like status changes, but that's just semantics I guess.
Is probably fine. Just a comment.
Also TRIGGERS_DROPPED has typos in the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I identified the states based on the failure points in Lhm, so that's why there are many states. Also even if someone is not polling these many states, I don't see much downside in having more number of states that we have anticipated as these states are just of a Hash and don't really trigger any extra behaviour within the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing is that there's no "state machine" per say. It's more like status changes, but that's just semantics I guess
True. My wording is a bit misleading as there isn't a real state machine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a typo in STATE_TRIGGERS_DROPPED
| @completion_percentage = 0 | ||
| @start_time = Time.now | ||
|
|
||
| @retry_helper = SqlRetry.new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @EtienneBerubeShopify. There may be merge conflicts.
lib/lhm/progress.rb
Outdated
| end | ||
|
|
||
| def update_state(state) | ||
| populate_table_stats if state == Lhm::STATE_COPYING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the intention of this line. You're saying that if update_state is called with STATE_COPYING, then you want to update the cached avg_row_length and cached total_bytes? Then use the values to calculate the progress?
I would call this method explicitly, instead of calling it implicitly when update_state is called with STATE_COPY as the argument. This makes it less surprising for the reader, as update_state doesn't suggest to me that it can update some cached variables. That said, I don't know if this is needed anyway, because we should figure out how we want to calculate progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, that putting populate_stats is confusing in here. Rather then calling it explicitly outside this class, I have transfered populate_stats to an another method called progress.update_before_copy which is called before chunker starts. This made more sense to me.
| AND table_name = '#{@origin.name}' | ||
| } | ||
| @retry_helper.with_retries do |retriable_connection| | ||
| retriable_connection.select_rows(query.join(' ')).first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @EtienneBerubeShopify . Also note to you that this is a select_rows 😛 (we just had a IRL conversation about a related topic)
spec/test_helper.rb
Outdated
| def parse_logs(buffer) | ||
| logs = buffer.string.split("\n") | ||
| line_logs = [] | ||
| json_logs = [] | ||
| cnt = 0 | ||
| json_log = "" | ||
|
|
||
| logs.each do |log| | ||
| line = log_expression_message(log) | ||
| if cnt > 0 || line == "{" | ||
| json_log += line | ||
| end | ||
| if line == "{" | ||
| cnt += 1 | ||
| elsif line == "}" | ||
| cnt -= 1 | ||
| if cnt == 0 | ||
| json_logs << JSON.parse(json_log) | ||
| json_log = "" | ||
| end | ||
| elsif line != "{" && line != "}" && cnt == 0 | ||
| line_logs << line | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simplify this a lot if printer.notify sent the hash directly, as opposed to a pretty json.
| @printer = Lhm::Printer::Percentage.new | ||
| end | ||
|
|
||
| it 'prints the percentage' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this test here to not break the interface for now.
shuhaowu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty close to be able to merge. There are some white space errors and some changes requested from the previous review was not performed. Please take care of it all in this round so we can merge it in the next one.
lib/lhm/chunker.rb
Outdated
| current_time = Time.now | ||
| if current_time - @start_time > (5 * 60) | ||
| if current_time - start_time > (5 * 60) | ||
| Lhm.logger.info("Inserted #{affected_rows} rows into the destination table from #{bottom} to #{top}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this message to not be super useful. All it says is that it inserted 2000 rows (or whatever the chunk size is set to) every 5 minutes. Can we have it emit some part from the progress every 5 minutes instead now that we have the progress tracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chaged this log message which looks something such as :-
"Inserted 2000 rows into the destination table
upto 2000 with a speed of 20000 bytes/sec"
lib/lhm/chunker.rb
Outdated
| @chunk_finder.validate | ||
| end | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespace is still here? Should we setup something like rubocop?
| if (extra = @max_length - message.length) < 0 | ||
| @max_length = message.length | ||
| extra = 0 | ||
| end | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I removed this since max_length is not used anywhere in the code except in the unit tests, so I decided to omit this out.
But I have added it back, since we don't want to change the existing structure of Printer.
lib/lhm/progress.rb
Outdated
| STATE_COPYING = "copying" | ||
| STATE_COPYING_DONE = "copying_done" | ||
| STATE_COPYING_FAILED = "copying_failed" | ||
| STATE_INITIAL = "initial" | ||
| STATE_SETUP_FAILED = "setup_failed" | ||
| STATE_SETUP_DONE = "setup_done" | ||
| STATE_SWITCHING_TABLES = "switching_tables" | ||
| STATE_SWITCHED_TABLES = "switched_tables" | ||
| STATE_SWITCHING_TABLES_FAILED = "switching_tables_failed" | ||
| STATE_TRIGGERS_DROPPED = "trigggers_dropeed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a typo in STATE_TRIGGERS_DROPPED
| attr_reader :state, :completion_percentage | ||
|
|
||
| def update_before_copy(start, limit) | ||
| populate_table_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: since this method is only ever used once, and it is somewhat cryptically named (what is table stats and what are we populating it with?), can we just move the content of the 3 lines of code of this method directly inline here?
lib/lhm/progress.rb
Outdated
| @copy_speed = @speedometer.speed | ||
|
|
||
| current_time = Time.now | ||
| if (current_time - @start_time > (5 * 60)) || @options.include?(:verbose_logging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably notify a bit faster, since this won't actually print anything, maybe even once a minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You changed the once a minute in the logs, but not here.
spec/integration/chunker_spec.rb
Outdated
| execute("insert into custom_primary_key_dest set id = 1001, pk = 2") | ||
|
|
||
| Lhm::Chunker.new(migration, connection, {throttler: throttler, printer: printer} ).run | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More whitespaces?
spec/integration/chunker_spec.rb
Outdated
|
|
||
| assert_equal 2, log_messages.length | ||
| assert log_messages[1].include?("Unexpected warning found for inserted row: Duplicate entry '1001' for key 'index_custom_primary_key_on_id'"), log_messages | ||
| assert log_messages[0].include?("Unexpected warning found for inserted row: Duplicate entry '1001' for key 'index_custom_primary_key_on_id'"), log_messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were the log_messages' indices changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were changed initially due to the fact that I had removed one of the logging lines, and as a result the tests wer failing.
Just found out which logging line I had removed earlier and reinstated it back. I will change the indices to be the same.
| @printer = Lhm::Printer::Percentage.new | ||
| end | ||
|
|
||
| it 'prints the percentage' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is still removed and not reinstated.
spec/unit/unit_helper.rb
Outdated
| module UnitHelper | ||
| LOG_EXPRESSION = /([\w]+),\s+\[([^\]\s]+)\s+#([^\]]+)]\s+(\w+)\s+--\s+(\w+)?:\s+(.+)/ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitespace?
5ec4ff5 to
1487811
Compare
| table_create(:users) | ||
|
|
||
| Lhm.change_table(:users, :atomic_switch => false) do |t| | ||
| Lhm.change_table(:users, {:atomic_switch => false}) do |t| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Lhm.change_table methods's signature looks like :-
def change_table(table_name, options = {}, &block)
so the options should be passed as a hash and not symbols. I don't understand why this was done in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ruby 2.7 this was allowed. I think the implicit hash arguments thing is getting deprecated
1487811 to
69b4bcc
Compare
shuhaowu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me now. That said let's NOT merge this during infrastructure freeze. This way if we need an emergency fix, we don't have this in there as well.
lib/lhm/progress.rb
Outdated
| @copy_speed = @speedometer.speed | ||
|
|
||
| current_time = Time.now | ||
| if (current_time - @start_time > (5 * 60)) || @options.include?(:verbose_logging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You changed the once a minute in the logs, but not here.
| table_create(:users) | ||
|
|
||
| Lhm.change_table(:users, :atomic_switch => false) do |t| | ||
| Lhm.change_table(:users, {:atomic_switch => false}) do |t| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ruby 2.7 this was allowed. I think the implicit hash arguments thing is getting deprecated
7b6de04 to
b9f56d5
Compare
b9f56d5 to
00cef28
Compare
tests unit tests fixes add integration tests fixes fix chunker unit tests remove unnecessary test helper methods addressing pr reviews remove parse_logs method addressing pr reviews addressing pr reviews addressing pr reviews changes changes changes changes changes addressing pr reviews modifications in completion percentage changes fixes
00cef28 to
a4f7dc6
Compare
Introducing a
Progressin order to enhance logging structure of the library.One of the fundamental problems I realized with the library is there is not a central progress struct which the lhm library exposes. Only thing that comes close to a progress currently is that lhm currently maintains something an internal
percentagevalue when copying data.In order to solve this I am introducing a
Progressstruct in the current library wich maintains the information throughtout the execution of an lhm, where it emits some important attributes such asstate- It can prove useful while debugging,, that which state the lhm is. That states could represent information such aslhm is copying, switching_tables, cleanup state etc. Here is a diagram of a simple state machine of how the states changes during a
lhm.change_table. The progress struct updates the state variable based on the state machine diagram represent above.bytes_copied- Bytes copied to thelhmn_table.total_bytes- Number of bytes in the original table. Note :- This is recorded at the start of an Lhm run, so the value might not be accurate as the size of these tables could reduce or increases based on traffic and changes made by the apps.speed- current speed of copy as estimated via linear regression with the aforementioned numbers over the last X minutes.Note :- The above metrics such as
bytes_copied,total_bytesandspeedare introduced in order to give us a better estimate of completion but these are just estimated and not accurate values to be relied on 100% percent. Additionally, there is a possibility that the addition/removal of a column/index will drastically alter the size of the table, which may make the progress in accurate. Example: if the data increases by 20% after a LHM, the progress will get to a maximum of 80%. Conversely, if the data decreases by 20% after a LHM, the progress will get to a maximum of 120%.Here is an example of the how the progress would look like when emitted values :-
I am also attaching a logs file which contains the logs of an lhm execution for adding a column to a table which has
3500 rows. This is so that we can see how the logs would like from beginning to end for an Lhm run.logs.txt