Skip to content

Commit d87324c

Browse files
committed
Merge remote-tracking branch 'dspace-data-collection/master'
2 parents a108b24 + 57ce64b commit d87324c

22 files changed

+2205
-2
lines changed

README.md

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# dspace-editing
1+
# dspace-api
22

3-
**Note**: Upgraded to Python 3 in 02/2019. The Python 2.x version can be downloaded [here](https://github.com/ehanson8/dspace-editing/releases)
3+
This repository was created from the merger of https://github.com/ehanson8/dspace-editing and https://github.com/ehanson8/dspace-data-collection, both of which have been archived. All further development will occur in this repository.
44

55
**Note**: These scripts were updated in 05/2018 for the new authentication method used by DSpace 6.x
66

@@ -30,6 +30,12 @@ Based on user input, adds a specified key-value pair with a specified language v
3030
#### [addNewItemsToCollection.py](addNewItemsToCollection.py)
3131
Based on user input, adds new items to the specified collection. In the specified directory, the script creates items and associated metadata based on a 'metadataNewFiles.json' file in the directory. The script then posts files for the appropriate items, which is determined by having the file name (minus the file extension) in a 'dc.identifier.other' field in the item metadata record.
3232

33+
#### [compareTwoKeysInCommunity.py](compareTwoKeysInCommunity.py)
34+
Based on user input, extracts the values of two specified keys from a specified community to a CSV file for comparison.
35+
36+
#### [countInitialedNamesByCollection.py](countInitialedNamesByCollection.py)
37+
Based on [mjanowiecki's](https://github.com/mjanowiecki) [findInitialedNamesByCollection.py](https://github.com/mjanowiecki/dspace-data-collection/blob/master/findInitialedNamesByCollection.py), find values in name fields that appear to have first initials that could be expanded to full names and provides a count for each collection when the count is more than zero.
38+
3339
#### [createItemMetadataFromCSV.py](createItemMetadataFromCSV.py)
3440
Based on user input, creates a JSON file of metadata that can be added to a DSpace item from the specified CSV file or from values directly specified in the script. The 'createMetadataElementCSV' function in the script is used to create a metadata element from the specified CSV file and has three variables:
3541

@@ -58,9 +64,65 @@ Based on user input, removes all key-value pairs with the specified key and valu
5864
#### [editBitstreamsNames.py](editBitstreamsNames.py)
5965
Based on a specified CSV file of DSpace item handles and replacement file names, replaces the name of bitstreams attached to the specified items.
6066

67+
#### [exportSelectedRecordMetadataToCSV.py](exportSelectedRecordMetadataToCSV.py)
68+
Based a CSV of item handles, extracts all metadata (except 'dc.description.provenance' values) from the selected items to a CSV file.
69+
70+
#### [findBogusUris.py](findBogusUris.py)
71+
Extracts the item ID and the value of the key 'dc.identifier.uri' to a CSV file when the value does not begin with the handlePrefix specified in the secrets.py file.
72+
73+
#### [findDuplicateKeys.py](findDuplicateKeys.py)
74+
Based on user input, extracts item IDs to a CSV file where there are multiple instances of the specified key in the item metadata.
75+
6176
#### [generateCollectionLevelAbstract.py](generateCollectionLevelAbstract.py)
6277
Based on user input, creates an HTML collection-level abstract that contains hyperlinks to all of the items in each series, as found in the metadata CSV. This assumes that the series title is recorded in 'dc.relation.ispartof' or a similar property in the DSpace item records. The abstract is then posted to the collection in DSpace.
6378

79+
#### [getCollectionMetadataJson.py](getCollectionMetadataJson.py)
80+
Based on user input, extracts all of the item metadata from the specified collection to a JSON file.
81+
82+
#### [getCompleteAndUniqueValuesForAllKeys.py](getCompleteAndUniqueValuesForAllKeys.py)
83+
Creates a 'completeValueLists' folder and for all keys used in the repository, extracts all values for a particular key to a CSV with item IDs. It also creates a 'uniqueValueLists' folder, that writes a CSV file for each key with all unique values and a count of how many times the value appears.
84+
85+
#### [getCompleteAndUniqueValuesForAllKeysInCommunity.py](getCompleteAndUniqueValuesForAllKeysInCommunity.py)
86+
Creates a 'completeValueLists' folder and for all keys used in the specified community, extracts all values for a particular key to a CSV with item IDs. It also creates a 'uniqueValueLists' folder, that writes a CSV file for each key with all unique values and a count of how many times the value appears.
87+
88+
#### [getFacultyNamesFromETDs.py](getFacultyNamesFromETDs.py)
89+
Based on user input, extracts all values from 'dc.contributor.advisor' and 'dc.contributor.committeeMember' fields from items in collections in the specified community.
90+
91+
#### [getGlobalLanguageValues.py](getGlobalLanguageValues.py)
92+
Extracts all unique language values used by metadata entries in the repository to a CSV file.
93+
94+
#### [getHandlesAndBitstreamsFromCollection.py](getHandlesAndBitstreamsFromCollection.py)
95+
Based on user input, extracts all the handles and bitstreams associated with the items in the specified collection to a CSV file.
96+
97+
#### [getLanguageValuesForKeys.py](getLanguageValuesForKeys.py)
98+
Extracts all unique pairs of keys and language values used by metadata entries in the repository to a CSV file.
99+
100+
#### [getRecordsAndValuesForKey.py](getRecordsAndValuesForKey.py)
101+
Based on user input, extracts the ID and URI for all items in the repository with the specified key, as well as the value of the specified key, to a CSV file.
102+
103+
#### [getRecordsAndValuesForKeyInCollection.py](getRecordsAndValuesForKeyInCollection.py)
104+
Based on user input, extracts the ID and URI for all items in the specified collection with the specified key, as well as the value of the specified key, to a CSV file.
105+
106+
#### [getRecordsWithKeyAndValue.py](getRecordsWithKeyAndValue.py)
107+
Based on user input, extracts the ID and URI for all items in the repository with the specified key-value pair to a CSV file.
108+
109+
#### [identifyItemsMissingKeyInCommunity.py](identifyItemsMissingKeyInCommunity.py)
110+
Based on user input, extracts the IDs of items from a specified community that do not have the specified key.
111+
112+
#### [metadataCollectionsKeysMatrix.py](metadataCollectionsKeysMatrix.py)
113+
Creates a matrix containing a count of each time a key appears in each collection in the repository.
114+
115+
#### [metadataOverview.py](metadataOverview.py)
116+
Produces several CSV files containing different information about the structure and metadata of the repository:
117+
118+
|File Name |Description|
119+
|--------------------------|--------------------------------------------------------------------------|
120+
|collectionMetadataKeys.csv | A list of all keys used in each collection with collection name, ID, and handle.|
121+
|dspaceIDs.csv | A list of every item ID along with the IDs of the collection and community that contains that item.|
122+
|dspaceTypes.csv | A list of all unique values for the key 'dc.type.'|
123+
|keyCount.csv | A list of all unique keys used in the repository, as well as a count of how many times it appear.|
124+
|collectionStats.csv | A list of all collections in the repository with the collection name, ID, handle, and number of items.|
125+
64126
#### [overwriteExistingMetadata.py](overwriteExistingMetadata.py)
65127
Based on a specified CSV file of DSpace item handles and file identifiers, replaces the metadata of the items with specified handles with the set of metadata elements associated with the corresponding file identifier in a JSON file of metadata entries named 'metadataOverwrite.json.'
66128

checkInventory.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
import argparse
2+
import pandas as pd
3+
import os
4+
5+
6+
def main():
7+
# begin: argument parsing
8+
parser = argparse.ArgumentParser()
9+
10+
parser.add_argument('-i', '--inventory', required=True,
11+
help='csv file containing the inventory. the path, if given, can be absolute or relative to this script')
12+
13+
parser.add_argument('-d', '--dataDir',
14+
help='directory containing the data. if omitted, data will be read from the directory containing the inventory file')
15+
16+
parser.add_argument('-f', '--field',
17+
help='field in the csv containing the fileNames. default: name')
18+
19+
parser.add_argument('-v', '--verbose', action='store_true',
20+
help='increase output verbosity')
21+
22+
args = parser.parse_args()
23+
24+
if not args.dataDir:
25+
(args.dataDir, null) = os.path.split(args.inventory)
26+
27+
if not args.field:
28+
args.field = 'name'
29+
30+
if args.verbose:
31+
print('verbosity turned on')
32+
print('reading inventory from {}'.format(args.inventory))
33+
print('fileNames read from field named {}'.format(args.field))
34+
print('searching for files in {}'.format(args.dataDir))
35+
# end: argument parsing
36+
37+
inventory = pd.read_csv(args.inventory, usecols=[args.field])
38+
fileNames = inventory[args.field]
39+
foundfiles = 0
40+
missingfiles = 0
41+
for fileName in fileNames:
42+
if os.path.isfile(args.dataDir + '/' + fileName):
43+
if args.verbose: print('{} is not missing'.format(fileName))
44+
foundfiles += 1
45+
else:
46+
print('{} is missing'.format(fileName))
47+
missingfiles += 1
48+
49+
print('{} files found and {} files missing'.format(foundfiles, missingfiles))
50+
51+
52+
if __name__ == "__main__": main()

compareTwoKeysInCommunity.py

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
import json
2+
import requests
3+
import secrets
4+
import csv
5+
import time
6+
import urllib3
7+
import argparse
8+
9+
secretsVersion = input('To edit production server, enter the name of the secrets file: ')
10+
if secretsVersion != '':
11+
try:
12+
secrets = __import__(secretsVersion)
13+
print('Editing Production')
14+
except ImportError:
15+
print('Editing Stage')
16+
else:
17+
print('Editing Stage')
18+
19+
parser = argparse.ArgumentParser()
20+
parser.add_argument('-1', '--key', help='the first key to be output. optional - if not provided, the script will ask for input')
21+
parser.add_argument('-2', '--key2', help='the second key to be output. optional - if not provided, the script will ask for input')
22+
parser.add_argument('-i', '--handle', help='handle of the community to retreive. optional - if not provided, the script will ask for input')
23+
args = parser.parse_args()
24+
25+
if args.key:
26+
key = args.key
27+
else:
28+
key = input('Enter first key: ')
29+
if args.key2:
30+
key2 = args.key2
31+
else:
32+
key2 = input('Enter second key: ')
33+
if args.handle:
34+
handle = args.handle
35+
else:
36+
handle = input('Enter community handle: ')
37+
38+
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
39+
40+
baseURL = secrets.baseURL
41+
email = secrets.email
42+
password = secrets.password
43+
filePath = secrets.filePath
44+
verify = secrets.verify
45+
skippedCollections = secrets.skippedCollections
46+
47+
startTime = time.time()
48+
data = {'email':email,'password':password}
49+
header = {'content-type':'application/json','accept':'application/json'}
50+
session = requests.post(baseURL+'/rest/login', headers=header, verify=verify, params=data).cookies['JSESSIONID']
51+
cookies = {'JSESSIONID': session}
52+
headerFileUpload = {'accept':'application/json'}
53+
54+
status = requests.get(baseURL+'/rest/status', headers=header, cookies=cookies, verify=verify).json()
55+
print('authenticated')
56+
57+
endpoint = baseURL+'/rest/handle/'+handle
58+
community = requests.get(endpoint, headers=header, cookies=cookies, verify=verify).json()
59+
communityID = community['uuid']
60+
61+
itemList = []
62+
endpoint = baseURL+'/rest/communities'
63+
collections = requests.get(baseURL+'/rest/communities/'+str(communityID)+'/collections', headers=header, cookies=cookies, verify=verify).json()
64+
for j in range (0, len (collections)):
65+
collectionID = collections[j]['uuid']
66+
print(collectionID)
67+
if collectionID not in skippedCollections:
68+
offset = 0
69+
items = ''
70+
while items != []:
71+
items = requests.get(baseURL+'/rest/collections/'+str(collectionID)+'/items?limit=200&offset='+str(offset), headers=header, cookies=cookies, verify=verify)
72+
while items.status_code != 200:
73+
time.sleep(5)
74+
items = requests.get(baseURL+'/rest/collections/'+str(collectionID)+'/items?limit=200&offset='+str(offset), headers=header, cookies=cookies, verify=verify)
75+
items = items.json()
76+
for k in range (0, len (items)):
77+
itemID = items[k]['uuid']
78+
itemList.append(itemID)
79+
offset = offset + 200
80+
print(offset)
81+
elapsedTime = time.time() - startTime
82+
m, s = divmod(elapsedTime, 60)
83+
h, m = divmod(m, 60)
84+
print('Item list creation time: ','%d:%02d:%02d' % (h, m, s))
85+
86+
valueList = []
87+
for number, itemID in enumerate(itemList):
88+
itemsRemaining = len(itemList) - number
89+
print('Items remaining: ', itemsRemaining, 'ItemID: ', itemID)
90+
metadata = requests.get(baseURL+'/rest/items/'+str(itemID)+'/metadata', headers=header, cookies=cookies, verify=verify).json()
91+
itemTuple = (itemID,)
92+
tupleValue1 = ''
93+
tupleValue2 = ''
94+
for l in range (0, len (metadata)):
95+
if metadata[l]['key'] == key:
96+
metadataValue = metadata[l]['value']
97+
tupleValue1 = metadataValue
98+
if metadata[l]['key'] == key2:
99+
metadataValue = metadata[l]['value']
100+
tupleValue2 = metadataValue
101+
itemTuple = itemTuple + (tupleValue1 , tupleValue2)
102+
valueList.append(itemTuple)
103+
print(itemTuple)
104+
print(valueList)
105+
106+
elapsedTime = time.time() - startTime
107+
m, s = divmod(elapsedTime, 60)
108+
h, m = divmod(m, 60)
109+
print('Value list creation time: ','%d:%02d:%02d' % (h, m, s))
110+
111+
f=csv.writer(open(filePath+key+'-'+key2+'Values.csv', 'w'))
112+
f.writerow(['itemID']+[key]+[key2])
113+
for i in range (0, len (valueList)):
114+
f.writerow([valueList[i][0]]+[valueList[i][1]]+[valueList[i][2]])
115+
116+
logout = requests.post(baseURL+'/rest/logout', headers=header, cookies=cookies, verify=verify)
117+
118+
elapsedTime = time.time() - startTime
119+
m, s = divmod(elapsedTime, 60)
120+
h, m = divmod(m, 60)
121+
print('Total script run time: ', '%d:%02d:%02d' % (h, m, s))

countInitialedNamesByCollection.py

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
import json
2+
import requests
3+
import secrets
4+
import csv
5+
import re
6+
import time
7+
import urllib3
8+
9+
secretsVersion = input('To edit production server, enter the name of the secrets file: ')
10+
if secretsVersion != '':
11+
try:
12+
secrets = __import__(secretsVersion)
13+
print('Editing Production')
14+
except ImportError:
15+
print('Editing Stage')
16+
else:
17+
print('Editing Stage')
18+
19+
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
20+
21+
baseURL = secrets.baseURL
22+
email = secrets.email
23+
password = secrets.password
24+
filePath = secrets.filePath
25+
verify = secrets.verify
26+
skippedCollections = secrets.skippedCollections
27+
28+
startTime = time.time()
29+
data = {'email':email,'password':password}
30+
header = {'content-type':'application/json','accept':'application/json'}
31+
session = requests.post(baseURL+'/rest/login', headers=header, verify=verify, params=data).cookies['JSESSIONID']
32+
cookies = {'JSESSIONID': session}
33+
headerFileUpload = {'accept':'application/json'}
34+
35+
status = requests.get(baseURL+'/rest/status', headers=header, cookies=cookies, verify=verify).json()
36+
userFullName = status['fullname']
37+
print('authenticated')
38+
39+
collectionIds = []
40+
endpoint = baseURL+'/rest/communities'
41+
communities = requests.get(endpoint, headers=header, cookies=cookies, verify=verify).json()
42+
for community in communities:
43+
communityID = community['uuid']
44+
collections = requests.get(baseURL+'/rest/communities/'+str(communityID)+'/collections', headers=header, cookies=cookies, verify=verify).json()
45+
for collection in collections:
46+
collectionID = collection['uuid']
47+
if collectionID not in skippedCollections:
48+
collectionIds.append(collectionID)
49+
50+
names = []
51+
keys = ['dc.contributor.advisor', 'dc.contributor.author', 'dc.contributor.committeeMember', 'dc.contributor.editor', 'dc.contributor.illustrator', 'dc.contributor.other', 'dc.creator']
52+
53+
f = csv.writer(open('initialCountInCollection.csv', 'w'))
54+
f.writerow(['collectionName']+['handle']+['initialCount'])
55+
56+
for number, collectionID in enumerate(collectionIds):
57+
initialCount = 0
58+
collectionsRemaining = len(collectionIds) - number
59+
print(collectionID, 'Collections remaining: ', collectionsRemaining)
60+
collection = requests.get(baseURL+'/rest/collections/'+str(collectionID), headers=header, cookies=cookies, verify=verify).json()
61+
collectionName = collection['name']
62+
collectionHandle = collection['handle']
63+
collSels = '&collSel[]=' + collectionID
64+
offset = 0
65+
recordsEdited = 0
66+
items = ''
67+
while items != []:
68+
for key in keys:
69+
endpoint = baseURL+'/rest/filtered-items?query_field[]='+key+'&query_op[]=exists&query_val[]='+collSels+'&limit=100&offset='+str(offset)
70+
print(endpoint)
71+
response = requests.get(endpoint, headers=header, cookies=cookies, verify=verify).json()
72+
items = response['items']
73+
for item in items:
74+
itemLink = item['link']
75+
metadata = requests.get(baseURL + itemLink + '/metadata', headers=header, cookies=cookies, verify=verify).json()
76+
for metadata_element in metadata:
77+
if metadata_element['key'] == key:
78+
individual_name = metadata_element['value']
79+
for metadata_element in metadata:
80+
if metadata_element['key'] == 'dc.identifier.uri':
81+
uri = metadata_element['value']
82+
contains_initials = re.search(r'(\s|,|[A-Z]|([A-Z]\.))[A-Z](\s|$|\.|,)', individual_name)
83+
contains_middleinitial = re.search(r'((\w{2,},\s)|(\w{2,},))\w[a-z]+', individual_name)
84+
contains_parentheses = re.search(r'\(|\)', individual_name)
85+
if contains_middleinitial:
86+
continue
87+
elif contains_parentheses:
88+
continue
89+
elif contains_initials:
90+
initialCount += 1
91+
else:
92+
continue
93+
offset = offset + 200
94+
print(offset)
95+
if initialCount > 0:
96+
f.writerow([collectionName]+[baseURL+'/'+collectionHandle]+[str(initialCount).zfill(6)])
97+
98+
logout = requests.post(baseURL+'/rest/logout', headers=header, cookies=cookies, verify=verify)
99+
100+
elapsedTime = time.time() - startTime
101+
m, s = divmod(elapsedTime, 60)
102+
h, m = divmod(m, 60)
103+
print('Total script run time: ', '%d:%02d:%02d' % (h, m, s))

0 commit comments

Comments
 (0)