-
Notifications
You must be signed in to change notification settings - Fork 11
Exercises #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
maria65mj
wants to merge
2
commits into
amangup:master
Choose a base branch
from
maria65mj:patch-2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Exercises #6
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| # Problems | ||
|
|
||
| ## A: Drinking water availability | ||
|
|
||
| 1. Write a query which outputs the **total** drinking water availability for the most recent year in data, for each region. Sort the output by the value of availability in descending order. | ||
| ```SQL | ||
| SELECT region_name, year, safe_drinking_water_total as total_drinking_water | ||
| FROM public.drinking_water_availability | ||
| WHERE year=2015 AND safe_drinking_water_total is NOT NULL | ||
| ORDER BY total_drinking_water DESC | ||
| ``` | ||
|
|
||
| 2. **Rural/Urban ratio**: For regions where both rural and urban data are available, for most recent year, find the ratio of rural/urban drinking water availability. Sort the table by this ratio, in ascending order. | ||
| ```SQL | ||
| SELECT region_name as region, year, (safe_drinking_water_rural::float/safe_drinking_water_urban) as rural_urban_ratio | ||
| FROM public.drinking_water_availability | ||
| WHERE safe_drinking_water_rural is NOT NULL AND safe_drinking_water_urban is not NULL AND year=2015 | ||
| ORDER BY rural_urban_ratio | ||
| ``` | ||
|
|
||
| 3. **Growth**: For regions where **total** drinking water availability value is available, find the growth in drinking water availability. Growth is defined as the ratio of latest value to the oldest value. | ||
| ```SQL | ||
| SELECT region, y2005/y2015 as growth | ||
| FROM (SELECT region_name as region, | ||
| sum(CASE WHEN year=2005 THEN safe_drinking_water_total ELSE NULL END) AS y2005, | ||
| sum(CASE WHEN year=2015 THEN safe_drinking_water_total ELSE NULL END) AS y2015 | ||
| FROM public.drinking_water_availability | ||
| GROUP By region_name) as origin | ||
| WHERE y2005 IS NOT NULL AND y2015 IS NOT NULL | ||
| ORDER BY growth DESC | ||
| ``` | ||
|
|
||
| ## B: GDP | ||
|
|
||
| 1. Write a query which outputs the GDP per capita for the most recent year in data, for each region. Sort the output by the value of GDP per capita in descending order. | ||
| ```SQL | ||
| SELECT region_name, year, gdp_per_capita | ||
| FROM public.gdp | ||
| WHERE year=2016 | ||
| ORDER BY gdp_per_capita DESC | ||
| ``` | ||
|
|
||
| 2. **Comparison with average growth**: The _average growth_ of GDP per capita can be calculated by comparing the oldest and the most recent value for the "Total, all countries" region. We want to know which countries have grown faster than average, and which have grown slower than average. Write a query that outputs two rows - one row contains the list of all regions where growth is lower than average, and another which contains the regions where growth is higher than average. | ||
|
|
||
| Here is how the output looks like (list of countries is truncated) | ||
|
|
||
| | Growth Category | Countries | ||
| | --- | --- | ||
| LOW GROWTH | Brunei Darussalam, Syrian Arab Republic, San Marino, Colombia, Ukraine, Saudi Arabia, Sao Tome and Principe, Latvia, Dem. People's Rep. Korea, Algeria, ... | ||
| HIGH GROWTH | Australia and New Zealand, Bangladesh, Luxembourg, Sweden, Dominican Republic, Ireland, Cambodia, Singapore, Portugal, Malta, Albania, Oceania, Maldives, Israel, Malaysia, State of Palestine, Cyprus, China, ... | ||
|
|
||
| You will need to know this function in PostgreSQL: [string_agg()](https://www.dbrnd.com/2016/09/postgresql-string_agg-to-concatenate-string-per-each-group-like-sql-server-stuff-string-aggregation-function/). This is an aggregate function that allows you to list out all strings. | ||
|
|
||
| ```SQL | ||
| SELECT region_name, (y2016::float/y1985) as growth, | ||
| (CASE WHEN (y2016::float/y1985)>=3.652 THEN 'High Growth' ELSE 'Low Growth' END) as comparision_growth | ||
| FROM (SELECT region_name, | ||
| sum(CASE WHEN year=1985 THEN gdp_per_capita ELSE NULL END) AS y1985, | ||
| sum(CASE WHEN year=2016 THEN gdp_per_capita ELSE NULL END) AS y2016 | ||
| FROM public.gdp | ||
| GROUP BY region_name) as original | ||
| WHERE y2016 IS NOT NULL AND y1985 IS NOT NULL | ||
| ``` | ||
| **UGH HOW DO YOU DO THIS PROBLEM!!!!!!** | ||
|
|
||
| # C: Internet penetration | ||
|
|
||
| 1. Write a query which outputs the internet penetration for the most recent year in data, for each region. Sort the output by the value of availability in descending order. | ||
| ```SQL | ||
| SELECT region_name, percent_internet_penetration as percent_internet_penetration_2016 | ||
| FROM public.internet_penetration | ||
| WHERE year=2016 | ||
| ORDER BY percent_internet_penetration DESC | ||
| ``` | ||
|
|
||
| 2. **Growth**: Find the growth in internet penetration for each region. Growth is defined as the ratio of latest value to the oldest value. | ||
| ```SQL | ||
| SELECT region_name, | ||
| SUM(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END)/ | ||
| SUM(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END)::float as growth | ||
| FROM public.internet_penetration | ||
| GROUP BY region_name | ||
| HAVING SUM(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END) != 0 AND | ||
| SUM(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END)/ | ||
| SUM(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END)::float IS NOT NULL | ||
| ORDER BY growth DESC | ||
| ``` | ||
|
|
||
| # D: Correlations | ||
|
|
||
| The main reason I chose these datasets was to see how these different markers correlate with each other. Does Internet penetration increase with GDP? Are there countries where more people are getting access to internet while drinking water availability is not improving? | ||
|
|
||
| - To quantify correlation, we will use the [Pearson correlation cofficient (wikipedia link)](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). | ||
| - This is a value which ranges from -1 to +1 : | ||
| - -1 signifies that the series whose correlation is being measured have opposite growth patterns, | ||
| - 0 means one is growing independent of the other, and | ||
| - 1 means that their growth is perfectly aligned. | ||
|
|
||
| - To calculate the Pearson correlation coefficient. The [list of all aggregate functions](https://www.postgresql.org/docs/10/functions-aggregate.html) includes the descriptions of the `corr()` function in PostgreSQL, which we will use. | ||
|
|
||
| - As we've done previously in this exercise set, we will define **Growth** as the ratio of most recent value to the oldest value. But, as we are going to compare growth across different markers (gdp, internet connectivity, etc.), we need to make sure we measure growth over the same period of time, for each marker. | ||
| - For these exercises, take the longest overlapping time period to measure growth. | ||
|
|
||
|
|
||
| ## Problems | ||
|
|
||
| 1. Is there any region where internet connectivity is higher than total drinking availability? | ||
| ```SQL | ||
| SELECT region_name, year, | ||
| (CASE WHEN percent_internet > water THEN 'Internet' ELSE NULL END) as winner | ||
| FROM (SELECT i.region_name, i.year, i.percent_internet_penetration as percent_internet, w.safe_drinking_water_total as water | ||
| FROM public.internet_penetration as i | ||
| INNER JOIN public.drinking_water_availability as w | ||
| ON i.year = w.year and i.region_name = w.region_name | ||
| WHERE i.percent_internet_penetration IS NOT NULL AND | ||
| w.safe_drinking_water_total IS NOT NULL) as original | ||
| WHERE CASE WHEN percent_internet>water THEN 'Internet' ELSE NULL END IS NOT NULL | ||
| ``` | ||
| 2. List all countries where growth in internet connectivity is lower than growth in drinkable water availability. | ||
| 3. Find the correlation between GDP per capita and GDP in current prices. Does the value look strange? What explains the value? | ||
| 4. Find the correlation between growth of internet connectivity and total drinkable water availability. | ||
| 5. Find the correlation between growth of internet connectivity and growth of GDP per capita | ||
| 6. Find the correlation between growth of GDP per capita and drinking water availability. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I am correct, but based on the proposed output, the
region_nameneeds to be aggregated using the string_agg() function and not out on its own. And then we can group by the growth (what you have as 'comparison_growth'.Based on your code it would look something like this: