Assignment 2 Spatial Data S25
Guidelines for the Assignment (FINAL):
The Spatial Data Assignment is an assignment in one step. It builds upon the work we have done in class since 25 February.
This assignment must be done in pairs or groups of three. It can be written together, but should be posted on each of the team member’s course site.
Imagine that we will work with the Zanzibar Gazette pdfs (in Drive) on a project entitled “Thick Mapping Zanzibar (1892-1919): Visualizing the Zanzibar Gazette.” Thick mapping, as we learned in class, is a process through which a number of geocoded datasets can be superimposed, creating a comparison between different kinds of geographically-specific phenomena. Each team will work on a dataset from the Zanzibar Gazette using an LLM to generate data from the OCR’d digital copies we possess. Imagine that the datasets that you generate could be all put together by the instructor after we are done to attempt some thick mapping of Zanzibar from the perspective of the British gazette.
This assignment has five main elements: (1) choosing a portion of the source material and identifying the spatial elements of this humanities source material (2) modeling (forming) a spatial dataset based on those elements, (3) correcting the data manually and enriching with any other relevant metadata, (4) visualizing the data on a map, and (5) discussion and interpretation of both process and product using an article on the humanities and genAI.
Step 1: Choice of topic that you would like to work on from within the Zanzibar Gazette pdfs. You should aim to create a dataset which has about 100 rows of data per person on your team. Please check with the instructor if you use one which is not included below.
- Imports into Zanzibar
- Exports from Zanzibar
- Shipping routes
- Licenses Issued (for commercial purposes)
- African Produce imported
- other
Since we are aiming for 200 rows for a group of two or 300 rows for a group of three, you may need to choose multiple pages from the Gazette, and this might mean choosing information in a similar format longitudinally in the Gazette. Your choice of topic may also need to be made based on some research about that topic across the newspaper. Remember that after 1913, the Gazette is more formalized.
Step 2: Modeling the data. You will need to choose what data you are going to extract from the source and imagine a structured data table into which it is set. It could be helpful to review some of the principles from an article we read earlier this term: Data Organization in Spreadsheets.
For step 2, you will semi-automate the process of data collection by using the pdf, or the text layer of the pdf or both. You will provide an LLM with a copy of this data and prompt it to extract the data for you. For this, you can use any tools that you feel comfortable with: chatGPT, Gemini (you can use your NYU netID for this one), Le Chat-Mistral, Claude Sonnets, etc. None of the available tools will do a sufficient job with historical materials. Remember that with Gemini, you can “show thinking” to help you in the process.
You are encouraged, therefore, to speak about how close to perfect you can get and how different prompts made a difference in the quality of the data you got. The best assignments will compare the performance of the different AI tools and will give insight into the iterative way that you went about creating the data in collaboration with the AI tool.
One of my former students, R.K., who works for an AI company in Canada said to me “I find the new models are the best and then they tend to deteriorate or hallucinate more often as they are trained on new noisy data or the engineers stop paying attention to them while they focus on newer versions.” Feel free to comment on this point if it is relevant. In other words, you can use the age of model as a point of comparison.
Please provide the best prompts you use to extract the data. A callout in markdown would be an excellent way to do this.
There will be some cleaning that is required to bring your dataset into shape so that it can be visualized. Please comment on what was required.
Step 3: Possible enriching the dataset with metadata. Sorting and combining the data: choosing the most important elements. Geocoding.
There are many ways that you can arrive at a final dataset. In particular, it may be interesting to combine some of the elements with bespoke tags (e.g. if you are working with lists of export items, you may add on a column to the data that combines all foodstuffs together, all animal products together, or all minerals together). This possible enriching of the data will be different according to the source you have chosen. You could also use the technique we explored in Overpass Turbo to add a data layer from OSM that is indirectly related to the locations of your data (e.g. historical monuments in the region).
This might be the case if you have data over time. You can bring together different tables, by the addition of a date column.
Finally, you have two options for geocoding: by hand or automating that process using Geocode by Awesome Table so that you have latitude and longitude columns.
Step 4: Visualizing the data.
Your choices for visualizing the spatial data are numerous. Perhaps the easiest is kepler.gl or UMAP. You may also choose to use some non-geographical visualization from within Google Sheets itself.
Insert the following line where you want the image to show up in the body of the .md file of your assignment:
<img src="/assets/{filename}.png" style="zoom:50%"/>
Use the zoom percentage to size the file. An example of the code of one of your instructor’s research projects can be found here
Step5 Discussion and Interpretation.
Read the draft article which has recently appeared entitled Provocations from the Humanities for Generative AI Research to help you in your discussion and interpretation of both the process and the product using genAI. You can do any extra research you want about Zanzibar and its historical context. Be sure to link to these materials or cite them at the end of the assignment.
Guiding questions for this assignment:
- What kind of source is the ZG?
- What kinds of data does that source provide?
- Was the data directly or indirectly communicated? In other words, did you have to assume anything to create new columns?
- How did you create prompt for ChatGPT, Gemini, Mistral or other tool to automate thse extraction of the information? What kind of “engineering” did you have to do to get better results with the prompts?
- How good was the automation? What did you have to change in order to clean the data?
- When you mapped the data, did you see any interesting patterns? Were there areas of concentration of points? Did mapping the additional columns (profession, gender, nationality, type of product, etc) lead to any visible clusters?
- If you were to do this over many more pages then you did for this assignment (say, look at exports over the years), what kind of results do you think the different scale would give?
- If you were to do this assignment with a source not included here, what would your ideal source be?
Your assignment should be about 1500 words long, with relevant several images, embedded links. Please place your data table into your assets folder (or other location in the repository where you host your GitHub site) and link to it in your assignment. Use a markdown cheatsheet such as this one to stylize your post, adding different layout features and embedded links if needed. You can refer to other readings if you want to, but this is not necessary for this assignment. Be sure to refer to the “Provocations” article mentioned above. Due date: 15 April.