Skip to content

Conversation

@Manan007224
Copy link

@Manan007224 Manan007224 commented Nov 2, 2021

Introducing a Progress in order to enhance logging structure of the library.

One of the fundamental problems I realized with the library is there is not a central progress struct which the lhm library exposes. Only thing that comes close to a progress currently is that lhm currently maintains something an internal percentage value when copying data.

In order to solve this I am introducing a Progress struct in the current library wich maintains the information throughtout the execution of an lhm, where it emits some important attributes such as

  • state - It can prove useful while debugging,, that which state the lhm is. That states could represent information such as
    lhm is copying, switching_tables, cleanup state etc. Here is a diagram of a simple state machine of how the states changes during a lhm.change_table. The progress struct updates the state variable based on the state machine diagram represent above.

Screen Shot 2021-10-19 at 6 30 11 AM

  • bytes_copied - Bytes copied to the lhmn_ table.
  • total_bytes - Number of bytes in the original table. Note :- This is recorded at the start of an Lhm run, so the value might not be accurate as the size of these tables could reduce or increases based on traffic and changes made by the apps.
  • speed - current speed of copy as estimated via linear regression with the aforementioned numbers over the last X minutes.

Note :- The above metrics such as bytes_copied, total_bytes and speed are introduced in order to give us a better estimate of completion but these are just estimated and not accurate values to be relied on 100% percent. Additionally, there is a possibility that the addition/removal of a column/index will drastically alter the size of the table, which may make the progress in accurate. Example: if the data increases by 20% after a LHM, the progress will get to a maximum of 80%. Conversely, if the data decreases by 20% after a LHM, the progress will get to a maximum of 120%.

Here is an example of the how the progress would look like when emitted values :-

{
  "state": "copying",
  "version": "3.4.2",
  "source_table": "table1",
  "destination_table": "lhmn_table1",
  "throttler": "None",
  "table_switcher": "LockedSwitcher",
  "avg_row_length": 42 bytes,
  "estimated_bytes_copied": 117600,
  "estimated_total_bytes": 147456,
  "completion_percentage": "79.75% complete",
  "estimated_copy_speed": "236731.08 bytes/sec",
  "min_pk_key": 1,
  "max_pk_key": 3500,
  "rows_written": 2800
}

I am also attaching a logs file which contains the logs of an lhm execution for adding a column to a table which has 3500 rows. This is so that we can see how the logs would like from beginning to end for an Lhm run.
logs.txt

@Manan007224
Copy link
Author

I was encountering some git rebase issues in the previous PR - #113, and was completely messed up. So I decided to open a new PR, with the updated code based on the few comments I received in the previous PR. Apologies for this.

@Manan007224 Manan007224 changed the title Lhm progress struct 2 Lhm progress struct Nov 2, 2021
@Manan007224 Manan007224 marked this pull request as draft November 2, 2021 06:19
@Manan007224 Manan007224 force-pushed the lhm-progress-struct-2 branch from 8efaeac to bad8fde Compare November 2, 2021 06:21
@printer = Lhm::Printer::Percentage.new
end

it 'prints the percentage' do
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printer.notify now just logs the message, so didn't felt much value in having tests for it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this test here to not break the interface for now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still removed and not reinstated.

@Manan007224 Manan007224 marked this pull request as ready for review November 2, 2021 06:32
Copy link

@shuhaowu shuhaowu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK, but there are a few important issues to resolve:

  1. We shouldn't break printer.notify's API.
  2. We should pass the progress struct hash directly out and let the printer print, instead of passing it via a string and having to parse it later.
  3. I'm a bit concerned about the way we estimate progress, based on an internal sum as opposed to checking information schema every time.

@chunk_finder.validate
end

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whitespace is still here? Should we setup something like rubocop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that might be good. I will set it up in a follow up PR.

Comment on lines 5 to 14
STATE_COPYING = "copying"
STATE_COPYING_DONE = "copying_done"
STATE_COPYING_FAILED = "copying_failed"
STATE_INITIAL = "initial"
STATE_SETUP_FAILED = "setup_failed"
STATE_SETUP_DONE = "setup_done"
STATE_SWITCHING_TABLES = "switching_tables"
STATE_SWITCHED_TABLES = "switched_tables"
STATE_SWITCHING_TABLES_FAILED = "switching_tables_failed"
STATE_TRIGGERS_DROPPED = "trigggers_dropeed"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of states. Do we need this much? It looks like we'll spend the majority of the time in copying and anything polling the state is unlikely to ever hit the other states. If that's the case, the updating of some of these states is more like a log line than switching the states. The other thing is that there's no "state machine" per say. It's more like status changes, but that's just semantics I guess.

Is probably fine. Just a comment.

Also TRIGGERS_DROPPED has typos in the value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I identified the states based on the failure points in Lhm, so that's why there are many states. Also even if someone is not polling these many states, I don't see much downside in having more number of states that we have anticipated as these states are just of a Hash and don't really trigger any extra behaviour within the library.

Copy link
Author

@Manan007224 Manan007224 Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing is that there's no "state machine" per say. It's more like status changes, but that's just semantics I guess

True. My wording is a bit misleading as there isn't a real state machine here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a typo in STATE_TRIGGERS_DROPPED

@completion_percentage = 0
@start_time = Time.now

@retry_helper = SqlRetry.new(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @EtienneBerubeShopify. There may be merge conflicts.

end

def update_state(state)
populate_table_stats if state == Lhm::STATE_COPYING
Copy link

@shuhaowu shuhaowu Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the intention of this line. You're saying that if update_state is called with STATE_COPYING, then you want to update the cached avg_row_length and cached total_bytes? Then use the values to calculate the progress?

I would call this method explicitly, instead of calling it implicitly when update_state is called with STATE_COPY as the argument. This makes it less surprising for the reader, as update_state doesn't suggest to me that it can update some cached variables. That said, I don't know if this is needed anyway, because we should figure out how we want to calculate progress.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, that putting populate_stats is confusing in here. Rather then calling it explicitly outside this class, I have transfered populate_stats to an another method called progress.update_before_copy which is called before chunker starts. This made more sense to me.

AND table_name = '#{@origin.name}'
}
@retry_helper.with_retries do |retriable_connection|
retriable_connection.select_rows(query.join(' ')).first
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @EtienneBerubeShopify . Also note to you that this is a select_rows 😛 (we just had a IRL conversation about a related topic)

Comment on lines 63 to 85
def parse_logs(buffer)
logs = buffer.string.split("\n")
line_logs = []
json_logs = []
cnt = 0
json_log = ""

logs.each do |log|
line = log_expression_message(log)
if cnt > 0 || line == "{"
json_log += line
end
if line == "{"
cnt += 1
elsif line == "}"
cnt -= 1
if cnt == 0
json_logs << JSON.parse(json_log)
json_log = ""
end
elsif line != "{" && line != "}" && cnt == 0
line_logs << line
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simplify this a lot if printer.notify sent the hash directly, as opposed to a pretty json.

@printer = Lhm::Printer::Percentage.new
end

it 'prints the percentage' do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this test here to not break the interface for now.

@Manan007224 Manan007224 requested a review from shuhaowu November 8, 2021 00:53
Copy link

@shuhaowu shuhaowu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty close to be able to merge. There are some white space errors and some changes requested from the previous review was not performed. Please take care of it all in this round so we can merge it in the next one.

current_time = Time.now
if current_time - @start_time > (5 * 60)
if current_time - start_time > (5 * 60)
Lhm.logger.info("Inserted #{affected_rows} rows into the destination table from #{bottom} to #{top}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this message to not be super useful. All it says is that it inserted 2000 rows (or whatever the chunk size is set to) every 5 minutes. Can we have it emit some part from the progress every 5 minutes instead now that we have the progress tracking?

Copy link
Author

@Manan007224 Manan007224 Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chaged this log message which looks something such as :-

"Inserted 2000 rows into the destination table 
  upto 2000 with a speed of 20000 bytes/sec"

@chunk_finder.validate
end

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whitespace is still here? Should we setup something like rubocop?

Comment on lines 44 to 48
if (extra = @max_length - message.length) < 0
@max_length = message.length
extra = 0
end

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I removed this since max_length is not used anywhere in the code except in the unit tests, so I decided to omit this out.

But I have added it back, since we don't want to change the existing structure of Printer.

Comment on lines 5 to 14
STATE_COPYING = "copying"
STATE_COPYING_DONE = "copying_done"
STATE_COPYING_FAILED = "copying_failed"
STATE_INITIAL = "initial"
STATE_SETUP_FAILED = "setup_failed"
STATE_SETUP_DONE = "setup_done"
STATE_SWITCHING_TABLES = "switching_tables"
STATE_SWITCHED_TABLES = "switched_tables"
STATE_SWITCHING_TABLES_FAILED = "switching_tables_failed"
STATE_TRIGGERS_DROPPED = "trigggers_dropeed"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a typo in STATE_TRIGGERS_DROPPED

attr_reader :state, :completion_percentage

def update_before_copy(start, limit)
populate_table_stats
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: since this method is only ever used once, and it is somewhat cryptically named (what is table stats and what are we populating it with?), can we just move the content of the 3 lines of code of this method directly inline here?

@copy_speed = @speedometer.speed

current_time = Time.now
if (current_time - @start_time > (5 * 60)) || @options.include?(:verbose_logging)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably notify a bit faster, since this won't actually print anything, maybe even once a minute?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed the once a minute in the logs, but not here.

execute("insert into custom_primary_key_dest set id = 1001, pk = 2")

Lhm::Chunker.new(migration, connection, {throttler: throttler, printer: printer} ).run
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More whitespaces?


assert_equal 2, log_messages.length
assert log_messages[1].include?("Unexpected warning found for inserted row: Duplicate entry '1001' for key 'index_custom_primary_key_on_id'"), log_messages
assert log_messages[0].include?("Unexpected warning found for inserted row: Duplicate entry '1001' for key 'index_custom_primary_key_on_id'"), log_messages
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were the log_messages' indices changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were changed initially due to the fact that I had removed one of the logging lines, and as a result the tests wer failing.

Just found out which logging line I had removed earlier and reinstated it back. I will change the indices to be the same.

@printer = Lhm::Printer::Percentage.new
end

it 'prints the percentage' do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still removed and not reinstated.

module UnitHelper
LOG_EXPRESSION = /([\w]+),\s+\[([^\]\s]+)\s+#([^\]]+)]\s+(\w+)\s+--\s+(\w+)?:\s+(.+)/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace?

@Manan007224 Manan007224 force-pushed the lhm-progress-struct-2 branch 5 times, most recently from 5ec4ff5 to 1487811 Compare November 8, 2021 23:48
@Manan007224 Manan007224 requested a review from shuhaowu November 9, 2021 00:33
table_create(:users)

Lhm.change_table(:users, :atomic_switch => false) do |t|
Lhm.change_table(:users, {:atomic_switch => false}) do |t|
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lhm.change_table methods's signature looks like :-
def change_table(table_name, options = {}, &block)

so the options should be passed as a hash and not symbols. I don't understand why this was done in the first place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ruby 2.7 this was allowed. I think the implicit hash arguments thing is getting deprecated

@Manan007224 Manan007224 force-pushed the lhm-progress-struct-2 branch from 1487811 to 69b4bcc Compare November 9, 2021 09:01
Copy link

@shuhaowu shuhaowu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me now. That said let's NOT merge this during infrastructure freeze. This way if we need an emergency fix, we don't have this in there as well.

@copy_speed = @speedometer.speed

current_time = Time.now
if (current_time - @start_time > (5 * 60)) || @options.include?(:verbose_logging)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed the once a minute in the logs, but not here.

table_create(:users)

Lhm.change_table(:users, :atomic_switch => false) do |t|
Lhm.change_table(:users, {:atomic_switch => false}) do |t|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ruby 2.7 this was allowed. I think the implicit hash arguments thing is getting deprecated

@Manan007224 Manan007224 force-pushed the lhm-progress-struct-2 branch from 7b6de04 to b9f56d5 Compare November 17, 2021 11:39
@Manan007224 Manan007224 force-pushed the lhm-progress-struct-2 branch from b9f56d5 to 00cef28 Compare January 5, 2022 20:51
tests

unit tests

fixes

add integration tests

fixes

fix chunker unit tests

remove unnecessary test helper methods

addressing pr reviews

remove parse_logs method

addressing pr reviews

addressing pr reviews

addressing pr reviews

changes

changes

changes

changes

changes

addressing pr reviews

modifications in completion percentage

changes

fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants