Note: The Python scripts mentioned in this tutorial are available as Github gists.
JSTOR has recently made freely available an “Early Journal Content Data Bundle,” which contains every article from every journal in JSTOR’s massive database that was published prior to 1923 (for US publications; prior to 1870 for other publications). This amounts to more than 450,000 articles from more than 200 journals. Some of the journals in this data bundle that would most interest historians of psychology include American Journal of Psychology, Philosophical Review, and Journal of Philosophy, Psychology, and Scientific Methods.1 There are a number of other prominent social science journals in the bundle that might be mined. These include American Anthropologist, American Journal of Sociology, and American Journal of Philology.
(Note, the examples below were done on a Mac. PC users will go through the same process, but the screen will look somewhat different.)
Go to the Data For Research beta website in order to get the JSTOR data bundle. Fill in the brief registration form if you do not already have a JSTOR “Data for Research” account. Then return to this past and download the file called ejc.tar.bz2 from the “here” link, about four lines from the bottom of the page:
(You can also download it from Archive.org.)
Note: This file is large (1.7GB), but when you expand it, it will become even larger: 450,000+ different files totaling about 7.5GB of disk space. So, make sure you place the file in a folder where you want all of these many thousands of files to reside before you expand it. Moving them afterward would take a great deal of time.
To expand a compressed file of this kind, most computer systems have a resident file-archiving program that allows you to do it just by double-clicking on the file name once you have placed in the folder where the articles will be housed. My own version of this program was a little too old to handle this file, so I had to download a program called “The Unarchiver” from the web). It is free, and did the job easily. Because the JSTOR data bundle is so large, it will probably take a long time to expand on your computer. Be patient.
Once the file has expanded, go into the folder called “bundle” and have a look at the file names. As you can see below, the names that JSTOR has chosen unfortunately look like this:
10.2307_4572210. They are completely useless for figuring out where the journal articles you are interested in are located. Not exactly user-friendly!
In order to find out which of the numerically-named files correspond to which journals, you are going to need a program that searches through each file for you, finds the journal title, and then outputs a list of the journals, along with the numbered file names that correspond to each journal.2
This process is precisely what is done by the first program here,
journal_list.py. If you look at the program (just using a basic text editor like Apple’s “Text Edit” or Windows’ “Notepad” to open it) we can begin to see how it works:
The first several lines all have the
# character in front of them. Any line that begins with this character is a “comment” for the reader; it is not processed by the program as a computer command. At the start of this program these early comment lines give you a bit of general information about the program: its name, a brief description of its function, who wrote it, and when. After the line that reads
#START PROGRAM, however, the comments tell you instead what each line of the program does.3
You can read through the program and its comments on your own, if you would like. After that, you should be able to run the program yourself, but first you need to install the programming language Python on your computer. You can download it from Python.org. Scroll down the page to the section titled “Download” and click on the version that is compatible with your operating system (OS X, Windows, Linux, etc.).4
If you are having difficulties with this, there is a short video that guides you through the process in the online course called “Learn to Program: The Fundamentals.”5 The video you want is the second one in Week 1 of the course, “Getting Started: Installing Python.” If you are interested in learning more about Python programming, this is a good place to start. The front page looks like this:
Once you have downloaded and opened Python, and have it in a window on your computer’s desktop, click on the IDLE icon and a Python programming window will appear (see below). IDLE is a programming environment for Python that has program debugger and other features.
(I have shown icons for my files here. Your computer may be set to simply display a list of files.)
The window will be mostly blank but for a couple of lines at the top that start with “Python 3.X.X…” We call this window the Python Shell (remember this term, it comes up again later):
(You may need to click on your or key a couple of times in order to get the Python prompt, which looks like this:
Now you are ready to go. You could just start writing Python commands here. Try entering
2+2, and then hitting the
Enter key. Did you get
4? Now try entering:
turkey=’gobble’. When you you should get nothing but the Python prompt:
>>>. Now enter the word:
turkey. When you you should get:
‘gobble’. You just created a variable called turkey and gave it as content the character string
‘gobble’. Then, when you called the variable name
turkey, the program gave you back its content:
‘gobble’. This can be fun, but normally we write programs in advance in a different window, and then run them so that their output appears in the Python shell. That is what we will do here.
Before you can run the Python programs I have written to handle the JSTOR Early Journal Data Bundle, you will need to create a new folder for the programs on your computer. Then copy the two programs
move_psychphil.py from this webpage into that folder. Then, click back to the IDLE window.
Once you are there, click on the menu bar at the top:
File > Open and find the file
journal_list.py from the folder where you stored it. Double click on it. This will open a new widow that contains the program, next to (or sometimes on top of) the Python Shell window. You will notice that the text of the programs now appears in a variety of colors (see below): The red words are the comments that tell you what the program is doing each stage along the way. The orange and purple words are reserved programming terms (e.g., for, in if, and, else, print, open). The green words are mostly the contents that we are putting into string variables (such as
‘gobble’ was for the variable
turkey, above). The black words are various other things.
Let’s run the program! You should have two windows on your screen: the one you just opened that contains the Python program
journal_list.py, and the mostly blank Python Shell in which we played with
turkey above. The first thing you have to do is change the
dirname (directory name) that appears in line 18 of the program. (Note: when you are in the program window, the line that the cursor is currently on is given at the lower right-hand corner of the window.) The line now reads:
dirname=’/Users/chriso/Desktop/JSTOR-journals/bundle/’. You need to change it so the portion between the single quotation marks reflects the full directory path to the folder in which you expanded the JSTOR Early Journal Content Data Bundle on your computer. Change it now. Make sure you leave the single quotation marks intact around your pathname, and that you leave dirname= just as it was. Then go to the menu bar at the top of the screen and click
File > Save.
Now we are ready to run the program. Click on the window containing the program. Go to the menu bar at the top of the screen, and click on
Run > Run Module. (If you didn’t save the program before, you will be prompted to save the file before you can run it. If this happens, just click OK.) Now look at the other window – the Python Shell. First you will see a line appear in the Shell window that says:
===== RESTART =====. Then there may be a long wait while the program imports the 450,000+ files in the JSTOR data bundle. Then it will start to list the names of journals it finds in those files, like this:
Some of the journal names will have three asterisks in front of them and be separated by blank lines above and below. These are journals the titles of which contain “PSYCH,” PHILOS”, or are the journals Science or Scientific Monthly. These are the journals I was most interested in (though some, like the Philosophical Transactions (of the Royal Society of London), are not related to our current work).
After each journal title is a number. This number corresponds to the last several digits in the JSTOR filename where the run of that particular journal’s articles begins. For instance, the articles from Annals of the American Academy of Political and Social Science start at the file that JSTOR named
10.2307_1008595 (where the portion after the underscore is the part that appears after the journal’s name in the program output). Articles from that journal continue on from that point until you reach the next journal title in the output (though be careful to note that there are large gaps in JSTOR’s file-numbering scheme).
How does the program find the journals titles? If you look in any of the article files (using “Text “Edit” or “Notepad”) you will find that, near the end of the file, there is a markup tag that reads: . The words immediately after that tag are the title of the journal. Immediately after the title is a second markup tag that reads: . The program just looks for those two tags and prints to the Python Shell whatever happens to be between them. Without those markup tags, it would be impossible for the program to tell the difference between the journal title and any other text in the file.
The program will run for some time. It will appear to stop periodically, but it is just reading through the thousands of files that are not relevant to our search. It is not finished executing until you see the Python prompt:
>>> (or, sometimes, until you get a red error message because the program ran longer than the number of files that were available). As the program runs, you will see why I decided to highlight the Psych, Philos, and Science journals with the stars: there are so many journals in the bundle that the ones we want are quite hard to spot by eye. (You can change line 40 in the program if you would like the program to highlight other journals instead.)
After the program has finished running, the list of journals will be in your Python Shell window only. (That is, it has not been saved to a file on your computer.) If you would like to keep the list so that you can peruse it later, simply select it and copy it, just like you would from any text document. Then use your favorite word processing program to open a new document and paste the list there. Then you can save it, alter it, search it, etc. in whatever way you would like.
The second program, called
move_psychphil.py, is longer and more complicated. First, if you didn’t before, copy it now into the same directory where you stored that last program,
journals_list.py. Then, from the IDLE window, use
File > Open to display the program in its own window. It looks like this:
(You may want to close the
journals_list.py window at this point to prevent confusion between the two.)
The new program,
move_psychphil.py, does a number of different things. First, it copies all of the article files from The Monist, Philosophical Review, and Journal of Philosophy, Psychology and Scientific Methods to a new folder. It also renames these files so that you can tell where they came from just by looking at their file names. Second, it creates a list of every article in each of these three journals and saves it to a
.csv file that can be easily imported into a spreadsheet program like Excel.6
The list includes the authors’ names, the article titles, and the citation information for each article. Also, the author’s name has been manipulated so that the last name is first, and the first name and middle initial(s) come after, with no spaces (to make alphabetization easier). One problem with this program is that, when it goes looking for the author’s last name, it occasionally picks up a “Jr.” instead. You will either have to alter the program to fix this, or just search through the list after it has been created, looking for and correcting names with “Jr.” in them. Also, you may find that non-English characters (e.g., anything with a diacritical mark, like an accent or an umlaut) are not rendered properly.
If you look at lines 35-37 of the program, you will find the place where you can input the titles of journals other than the three listed above, if you would like. There are a few suggestions in the comment lines 38-39. If you try to alter these lines, be sure not to disrupt the square brackets. First, there is a list of the full titles of the journals, followed by a list of the journal title abbreviations that I wanted to use in the names of the corresponding article files. (You will also have to adjust the code in lines 121-132, if you want .csv summary lists of the journals you have added.)
As above, before you run the program, you will have to change the directory name to suit your computer’s directory tree. But this time there are two of them, in lines 19 and 20. The variable
fromdirname contains the path to the JSTOR article files (you used this in the last program, where it was called just
dirname). The variable
todirname contains the path to the folder where you want the newly re-named article files to be stored, along with the
.csv list for each journal. Note: you will have to create the folder in the right place ahead of time. If the program cannot find the folder you have named in the
todirname line, it will crash.
To start the program, as before, go to the menu bar and click
Run > Run Module. There will be a long wait while the program gathers up all the hundreds of thousands of articles and starts to work through them, looking for ones that correspond to the journals listed in lines 35-37. Once it finds them, it will start to print to the Python Shell the journal title, article title, and author name for each relevant file it finds (see below). This will assure you that the program is running correctly.
However, this process of printing to the screen slows down considerably what is already a long-running program. After you have assured yourself that the program is running correctly, you may wish to shut off this printing to the screen by placing a
# at the start of each of four lines: 57, 76, 108, 114. This turns them into “comments” so that the computer does not execute them as commands. Then save the program and run it again. If you ever want the printing back, simply remove the three
#s again, save the program, and run it.7
If you look at the folder of your
todirname while the program is running, at various points in time you will see files being added rapidly. They will have names like
JPPSM.11.393.txt (the article published in Journal of Philosophy, Psychology and Scientific Methods, volume 11, starting on p. 393). There will also be three
.csv files created, each containing the summary of one of the three journals. They look like this:
That should give you a start into how to parse and gather the relevant parts of the massive JSTOR Early Journal Content Data Bundle. Once you have the articles you want gathered together in one place, named so that you can easily retrieve them, you can process them in any of a wide variety of ways.
JSTOR holds Mind as well, but because it was published in the UK only after 1870, it is not included in the bundle. Also, journals now owned by the American Psychological Association, such as Psychological Review and Psychological Bulletin are not included in JSTOR’s database. ↩
The process of picking out desired parts of files is often called “parsing,” like parsing a sentence to find the nouns, verbs, etc. ↩
Some comments are placed above the line they comment on; some are placed immediately to the right of the line they comment on. You may have to stretch your window horizontally so that the longer lines don’t wrap around, which can confuse matters somewhat. ↩
If you use a Mac, you will also need to install an extra program called Tcl/Tk, which is linked from footnote 2 at the bottom of the page. Install the version corresponding you your version of the Mac operating system, OS X ↩
You will have to “register” for the Coursera course to get access to video lectures. It takes just a minute and doesn’t seem to have any ramifications beyond your being able to watch the course’s videos. ↩
.csvstands for “comma separated variables.” It is just an ordinary text file, but it contains data points (e.g., author names, article titles, volume and page numbers) that have commas between them. This is a common format for storing data. ↩
If you examine the programming code closely, you will find other places where I have “commented out” lines of code that I used while debugging the program, but that I did not want to include in the final version. ↩