4 minute read

Guidelines for the Assignment:

The Spatial Data Assignment is an assignment in one step. It builds upon the work we did in class on October 17th, with a refresher on November 16th.

This assignment can be done alone or in pairs.

This exercise has five main elements: (1) choosing a source (2) identifying the spatial elements of any humanities source material (3) modeling a spatial dataset based on those elements, (4) enriching the spatial data with any other relevant metadata, (5) visualizing the data on a map, discussion and interpretion.

For this exercise, if you want to semi-automate the process you will need an account at openai.com

Step 1: Choice of source. You should identify the sourcebook you would like to work with. Below is a list of possible sources from archive.org, but you can also find your own. Please check with the instructor if you use one which is not included below.

Source #1 Fielding’s Travel Guide to Europe

Source #2 Standard Guide to Cuba, 1905

Source #3 Promenade dans Paris-in French. This source would require OCR.

Source #4 An Atlas of Ancient Egypt

Source #5 Smolensk Telephone Book from 1910 - in Russian Список абонентов Смоленской телефонной сети, устроенной и эксплуатируемой Правительством.

Source #6 The Times of India, Directory of Bombay City and Province, 1940.

Source #7 Directory of Museums of India, 1959

Source #8 Directory of Protestant Missions in China, 1936

Source #9 Hong Kong Telegraph There is a whole run of this newspaper. An interesting project could be locating advertisers in several issues.

Source #10 Kazakhstan Defense Enterprise Directory, 1995

If your source has not been optically character recognized (OCR’d), you need to do this. You can reach out to the instructor to help with this–it takes just a few minutes to do.

Step 2: Modeling the data. You should identify what data is available from the source.

For some sources, the number of pages involved may make it very difficult to extract all the information using chatGPT 3.5. You should make a choice of which part of the source you will work on, by which I mean, you will have to choose the number of pages you will use and you may have to work in batches. You want to have a dataset of at least 50 rows or so. This can be accomplished manually. If you use chatGPT to extract the data, please make sure that the number of rows is significantly larger than that.

Step 3: Enriching the data. Geocoding. Adding any other data.

When you are working with the data one of the things you will want to do is to examine what kinds of other information can be associated with the main points. You will need to add on columns which achieve that. These will be different according to the source.

Examples of other columns might be

  • professions of people
  • numbers of people
  • gender of person
  • neighborhoods, etc.

Finally, you can geocode by hand or automate that process using Geocode by Awesome Table so that you have latitude and longitude columns.

Step 4: Visualizing the data. Discussion and Interpretation.

Your choices for visualizing the spatial data are numerous. Perhaps the easiest is kepler.gl or google MyMaps. UMAP is fine, as are other packages.

Use color categories for the points to categorize types of points. When you have a map that shows the extent of your data, you can screenshot it and include it as an image.

Insert the following line where you want the image to show up in the body of the .md file of your assignment:

<img src="/assets/{filename}.png" style="zoom:50%"/>

Use the zoom percentage to size the file. An example of the code of one of your instructor’s research projects can be found here

Guiding questions for this assignment:

  • What kind of source is it?
  • What kinds of data does that source provide?
  • Was the data directly or indirectly communicated? In other words, did you have to assume anything to create new columns?
  • If you used chatGPT, how did you create a prompt to automate the extraction of the information?
  • When you mapped the data, did you see any interesting patterns? Were there areas of concentration of points? Did mapping the additional columns (profession, gender, etc) lead to any visible clusters?
  • If you were to do this over many more pages then you did for this assignment, what kind of results do you think the different scale would give?
  • If you were to do this assignment with a source not included here, what would your ideal source be?

Your assignment should be about 1250 words long, with several images, embedded links. Use a markdown cheatsheet such as this one to stylize your post, adding different layout features and embedded links if needed. You can refer to the readings if you want to, but this is not necessary for this assignment. Due date: 10 May 2024.