Skip to content

Commit 6062e2a

Browse files
authored
Merge pull request #5 from QuantGov/dev
Documentation updates, values error catching, and small fixes to list functions
2 parents f09ad8b + 488bd34 commit 6062e2a

File tree

3 files changed

+45
-55
lines changed

3 files changed

+45
-55
lines changed

README.md

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -21,30 +21,26 @@ import regcensus as rc
2121

2222
## Structure of the API
2323

24-
The API organizes data around __topics__, which are then divided into __series__. Within each series are __values__, which are the ultimate values of interest. Values are available by three sub-groups: agency, industry, and occupation. Presently, there are no series with occupation subgroup. However, these are available for future use. Topics broadly define the data available. For example, RegData for regulatory restrictions is falls under the broad topic "Regulatory Restrictions." Within Regulatory Restrictions topic, there are a number of series available. These include Total Restrictions, Total Wordcount, Total "Shall," etc.
24+
The API organizes data around __document types__, which are then divided into __series__. Within each series are __values__, which are the ultimate values of interest. Values are available by three sub-groups: agency, industry, and occupation. Presently, there are no series with occupation subgroup. However, these are available for future use. Document types broadly define the data available. For example, RegData for regulatory restrictions is falls under the broad document type "Regulatory Restrictions." Within Regulatory Restrictions document type, there are a number of series available. These include Total Restrictions, Total Wordcount, Total "Shall," etc.
2525

2626
A fundamental concept in RegData is the "document." In RegData, a set of documents represents a body of regulations for which we have produced regulatory restriction counts. For example, to produce data on regulatory restrictions imposed by the US Federal government, RegData uses the Code of Federal Regulations (CFR) as the source documents. Within the CFR, RegData identifies a unit of regulation as the title-part combination. The CFR is organized into 50 titles, and within each title are parts, which could have subparts, but not always. Under the parts are sections. Determining this unit of analyses is critical for the context of the data produced by RegData. Producing regulatory restriction data for US states follows the same strategy but uses the state-specific regulatory code.
2727

28-
In requesting data through the API, you must specify the document type and the indicate a preference for *summary* or *document-level*. By default, RegCensus API returns summarized data for the period of interest. This means that if you do not specify the *summary* preference, you will receive the summarized data for a period. The __list_series_period__ helper function (described below) returns the periods available for each series.
28+
In requesting data through the API, you must specify the document type and the indicate a preference for *summary* or *document-level*. By default, RegCensus API returns summarized data for the period of interest. This means that if you do not specify the *summary* preference, you will receive the summarized data for a period. The __get_periods__ helper function (described below) returns the periods available for each series.
2929

3030
RegCensus API defines a number of periods depending on the series. For example, the total restrictions series of Federal regulations uses two main periods: daily and annual. The daily data produces the number of regulatory restrictions issued on a particular date by the US Federal government. The same data are available on an annual basis.
3131

32-
There are six helper functions to retrieve information about these key components of regdata. These functions provider the following information: topics, documents, jurisdictions, series, agencies, and years with data. The list functions begin with __list__. For example, to view the list of topics call __list_topics__. When an topic id parameter is supplied, the function returns the details about a specific topic.
33-
34-
```
35-
rc.list_document_subtype()
36-
```
37-
38-
Each subtype comprises one or more *series*. The __list_series__ function returns the list of all series when no series id is provided. This call is a great place to start if you are looking for data based on a **topic** first.
32+
There are five helper functions to retrieve information about these key components of regdata. These functions provider the following information: document types, jurisdictions, series, agencies, and periods with data. The list functions begin with __list__.
3933

34+
Each document type comprises one or more *series*. The __list_series__ function returns the list of all series when no series id is provided.
4035

4136
```
4237
rc.list_jurisdictions(jurisdictionID = 38)
4338
```
44-
Just like the above function call, listing the jurisdictions is another great place to start. If you are looking for data for a specifc jurisdiction(s), this function
39+
40+
Listing the jurisdictions is another great place to start. If you are looking for data for a specifc jurisdiction(s), this function
4541
will return the jurisdiction_id for all jurisdiction, which is key for retrieving data on any individual jurisdiction.
4642

47-
The __get_periods__ function returns a list of all seriesa and the years with data available.
43+
The __get_periods__ function returns a list of all series and the years with data available for each jurisdiction.
4844

4945
The output from this function can serve as a reference for the valid values that can be passed to parameters in the __get_values__ function. The number of records returned is the unique combination of series and jurisdictions that are available in RegData. The function takes the optional argument jurisdiction id.
5046

@@ -106,10 +102,10 @@ The __get_values__ function is the primary function for obtaining RegData from t
106102
* country (optional) - specify if all values for a country's jurisdiction ID should be returned. Default is False.
107103
* verbose (optional) - value specifying how much debugging information should be printed for each function call. Higher number specifies more information, default is 0.
108104

109-
In the example below, we are interested in the total number of restrictions and total numbe rof words (get_topics(1)) for the US (get_jurisdictions(38)) for the period 2010 to 2018.
105+
In the example below, we are interested in the total number of restrictions and total number of words for the US (get_jurisdictions(38)) for the period 2010 to 2019.
110106

111107
```
112-
rc.get_values(series = [1,2], jurisdiction = 38, date = [2010, 2018])
108+
rc.get_values(series = [1,2], jurisdiction = 38, date = [2010, 2019])
113109
```
114110

115111
### Values by Subgroup
@@ -118,7 +114,7 @@ You can obtain data for any of the three subgroups for each series - agencies, i
118114

119115
#### Values by Agencies
120116

121-
To obtain the restrictions for a specific agency (or agencies), the series id supplied must be in the list of available series by agency. To recap, the list of available series for an agency is available via the __list_series(id,by='agency')__ function, and the list of agencies with data is available via __get_agencies__ function.
117+
To obtain the restrictions for a specific agency (or agencies), the series id supplied must be in the list of available series by agency. To recap, the list of available series for an agency is available via the __list_series__ function, and the list of agencies with data is available via __get_agencies__ function.
122118

123119
```
124120
# Identify all agencies
@@ -132,14 +128,26 @@ rc.get_values(series = 91, jurisdiction = 38, date = [1990, 2018], agency = [81,
132128

133129
Some agency series may also have data by industry. For example, under the Total Restrictions topic, RegData includes the industry-relevant restrictions, which estimates the number of restrictions that apply to a given industry. These are available in both the main series - Total Restrictions, and the sub-group Restrictions by Agency.
134130

135-
To pull industry-relevant restrictions for an agency, call __get_agencies__ with the *industry* variable. The industry variable is of type string, and valid values include the industry codes specified in the classification system obtained by calling the __get_industries(jurisdiction)__ function.
131+
Valid values for industries include the industry codes specified in the classification system obtained by calling the __get_industries(jurisdiction)__ function.
136132

137133
In the example below, the series 92 (Restrictions by Agency and Industry), we can request data for the two industries 111 and 33 by the following code snippet.
138134

139135
```
140-
rc.get_values(series = 92, jurisdiction = 38, , time = c(1990,2000), industry = c('111','33'), agency = 66)
136+
rc.get_values(series = 92, jurisdiction = 38, time = [1990, 2000], industry = [111, 33], agency = 66)
137+
```
138+
139+
### Document-Level Values
140+
141+
For most use-cases, our summary-level data will be enough. However, document-level data is also available, though most of these queries take much longer to return results. Multi-year and industry results for jurisdiction 38 will especially take a long time. If you want the full dataset for United States Federal, consider using our bulk downloads, available at the [QuantGov website][2].
142+
143+
We can request the same data from above, but at the document level, using the following code snippet.
144+
145+
```
146+
rc.get_values(series = [1,2], jurisdiction = 38, date = ['2010-01-01', '2019-01-01'], summary=False)
141147
```
142148

149+
Note that for document-level queries, a full date (not just the year) is often required. See the __get_periods__ function for specifics by jurisdiction.
150+
143151
### Merging with Metadata
144152

145153
To minimize the network bandwidth requirements to use RegCensusAPI, the data returned by __get_values__ functions contain very minimal metadata. Once you pull the values by __get_values__, you can use the Pandas library to include the metadata.
@@ -160,4 +168,5 @@ agency_restrictions_ind = agency_by_industry.merge(
160168
agencies, by='agency_id')
161169
```
162170

163-
[1]:http://ec2-3-89-6-158.compute-1.amazonaws.com:8080/regdata/swagger-ui.html
171+
[1]:https://api.quantgov.org/swagger-ui.html
172+
[2]:https://www.quantgov.org/download-interactively

regcensus/__init__.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,11 @@
11
__all__ = [
22
'get_values',
3-
'get_topics',
43
'get_series',
54
'get_agencies',
65
'get_jurisdictions',
76
'get_periods',
87
'get_industries',
98
'get_documents',
10-
'list_topics',
119
'list_series',
1210
'list_document_types',
1311
'list_agencies',
@@ -17,14 +15,12 @@
1715

1816
from . api import (
1917
get_values,
20-
get_topics,
2118
get_series,
2219
get_agencies,
2320
get_jurisdictions,
2421
get_periods,
2522
get_industries,
2623
get_documents,
27-
list_topics,
2824
list_series,
2925
list_document_types,
3026
list_agencies,

regcensus/api.py

Lines changed: 19 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -128,20 +128,12 @@ def get_values(series, jurisdiction, date, filtered=True, summary=True,
128128

129129
# Puts flattened JSON output into a pandas DataFrame
130130
output = pd.io.json.json_normalize(requests.get(url_call).json())
131-
return clean_columns(output)
132-
133-
134-
def get_topics(topicID=''):
135-
"""
136-
Get metadata for all or one specific topic
137-
138-
Args: topicID (optional): ID for the topic
139-
140-
Returns: pandas dataframe with the metadata
141-
"""
142-
output = pd.io.json.json_normalize(
143-
requests.get(URL + f'/topics/{topicID}').json())
144-
return clean_columns(output)
131+
# Prints error message if call fails
132+
if (output.columns[:3] == ['title', 'status', 'detail']).all():
133+
print('WARNING:', output.iloc[0][-1])
134+
# Returns clean data if no error
135+
else:
136+
return clean_columns(output)
145137

146138

147139
def get_series(seriesID=''):
@@ -183,7 +175,7 @@ def get_jurisdictions(jurisdictionID=''):
183175
return clean_columns(output)
184176

185177

186-
def get_periods(jurisdictionID=''):
178+
def get_periods(jurisdictionID='', documentType=3):
187179
"""
188180
Get dates available for all or one specific jurisdiction
189181
and compatible series IDs
@@ -195,7 +187,8 @@ def get_periods(jurisdictionID=''):
195187
if jurisdictionID:
196188
output = pd.io.json.json_normalize(
197189
requests.get(
198-
URL + f'/periods?jurisdiction={jurisdictionID}').json())
190+
URL + (f'/periods?jurisdiction={jurisdictionID}&'
191+
f'documentType={documentType}')).json())
199192
else:
200193
output = pd.io.json.json_normalize(
201194
requests.get(URL + f'/periods/available').json())
@@ -235,37 +228,29 @@ def get_documents(jurisdictionID, documentType=3):
235228
return clean_columns(output)
236229

237230

238-
def list_topics():
231+
def list_document_types():
239232
"""
240-
Returns: a dictionary containing names of topics and associated IDs
233+
Returns: a dictionary containing names of documenttypes and associated IDs
241234
"""
242-
json = requests.get(URL + f'/topics/').json()
243-
return dict(sorted({t["topicName"]: t["topicID"] for t in json}.items()))
235+
json = requests.get(URL + f'/documenttypes').json()
236+
return dict(sorted({
237+
d["subtypeName"]: d["documentSubtypeID"]
238+
for d in json if d["subtypeName"]}.items()))
244239

245240

246241
def list_series():
247242
"""
248243
Returns: dictionary containing names of series and associated IDs
249244
"""
250-
json = requests.get(URL + f'/series/').json()
245+
json = requests.get(URL + f'/series').json()
251246
return dict(sorted({s["seriesName"]: s["seriesID"] for s in json}.items()))
252247

253248

254-
def list_document_types():
255-
"""
256-
Returns: a dictionary containing names of documenttypes and associated IDs
257-
"""
258-
json = requests.get(URL + f'/documenttypes/').json()
259-
return dict(sorted({
260-
d["subtypeName"]: d["documentSubtypeID"]
261-
for d in json if d["subtypeName"]}.items()))
262-
263-
264249
def list_agencies():
265250
"""
266251
Returns: dictionary containing names of agencies and associated IDs
267252
"""
268-
json = requests.get(URL + '/agencies/').json()
253+
json = requests.get(URL + '/agencies').json()
269254
return dict(sorted({
270255
a["agencyName"]: a["agencyID"]
271256
for a in json if a["agencyName"]}.items()))
@@ -275,7 +260,7 @@ def list_jurisdictions():
275260
"""
276261
Returns: dictionary containing names of jurisdictions and associated IDs
277262
"""
278-
json = requests.get(URL + f'/jurisdictions/').json()
263+
json = requests.get(URL + f'/jurisdictions').json()
279264
return dict(sorted({
280265
j["jurisdictionName"]: j["jurisdictionID"] for j in json}.items()))
281266

@@ -287,7 +272,7 @@ def list_industries(jurisdictionID):
287272
Returns: dictionary containing names of industries and their NAICS codes
288273
"""
289274
json = requests.get(
290-
URL + f'/industries?jurisdiction={jurisdictionID}/').json()
275+
URL + f'/industries?jurisdiction={jurisdictionID}').json()
291276
return dict(sorted({
292277
i["industryName"]: i["industryCode"] for i in json}.items()))
293278

0 commit comments

Comments
 (0)