License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 04, 2022
Last Modified: October 11, 2022
Protocol Integer ID: 61978
Keywords: data cleaning, gender name analysis, namsor, dimensions, excel
Abstract
This protocol provides step-by-step instructions for downloading publication data from Dimensions, cleaning the data to identify first and last authors, and uploading the data into the Namsor name analysis tool to help determine the gender of authors. In these directions, publications from the journal Critical Care Medicine are used as an example.
The data cleaning instructions are based on the use of Microsoft Excel for Microsoft 365 MSO (Version 2202 Build 16.0.14931.20704) 64-bit on a Windows PC.
A personal account is needed to export articles from Dimensions. Log in or click Register in the upper right to create an account.
To limit the results to a certain journal on the Free Web App landing page, expand the Source Title filter in the left column. Click More at the bottom, type in the full name of the journal in the search box, and click Limit to. Title abbreviations should not be used (for example, use "Critical Care Medicine" and not "Crit Care Med").
The export limit in Dimensions varies depending on your subscription. If you are using the free web app, the export limit is 2500 records at a time. Considering this, you may have to export results in smaller sections if you are working with many results. The following step demonstrates how to do this by Publication Year.
From the results page, use the Publication Year filter on the left and select a year. Click Limit to at the bottom of the filters column.
To export records, click Save / Export to the right of the search box. Then choose Export results.
Use the export option for bibliometric mapping and click Export. This option allows you to export up to 2500 results in the free web app. If you need to export more records at once, try using other filters to reduce results. Otherwise, the subscription-based Dimensions allows for exports up to 50,000.
To download the CSV, click on your account profile name in the upper right corner and select Export center. Depending on the file size, it may take a few minutes for it to appear in the Export center.
Find your export request and click Download to the right. This will save a ZIP file to your computer that contains a CSV file with the publication data.
Preparing for Data Cleaning
Preparing for Data Cleaning
Open the CSV file in Excel and delete the first row that starts with "about this data". Save the original spreadsheet with a descriptive name (e.g., dimensions_original_year) and refer to later as needed.
Review your file for data that appears erroneous. Due to an Excel cell character limit of 32,767 characters, data could potentially be reformatted and displayed incorrectly. This might happen in instances when a publication has a very long author list in a single cell.
For example, row 2 in the screenshot below does not contain the correct values for Publication ID, DOI, Title, etc.
If this occurs, you can either delete these rows, understanding that this data loss is due to a limitation of using Excel, or use a different program that does not have Excel's cell character limits, such as a text editor or Google Sheets.
Copy and paste the Authors column into a new spreadsheet and give it a descriptive name (e.g., authors_year).
Removing Blank Rows
Removing Blank Rows
In the newly created spreadsheet with only author data, remove the blank rows, as these were from article records that did not list any authors. The following steps will demonstrate how to remove these blank rows.
Select the Authors column.
Go to the Data tab and select Text to Columns in the Data Tools section.
Select Delimited and click Finish.
Select the Authors column again.
Go to the Home tab and select Find & Select in the Editing section.
Select Go To Special...
Select Blanks and click OK. All blank cells from the Authors column should be highlighted.
Go to the Home tab. In the Cells section, click on the Delete drop down menu, select Delete Cells, then select Entire row.
Separating Authors
Separating Authors
If a publication has multiple authors, they will be listed in a single cell, separated by semicolons. Separate the authors so that they are each in their own column.
Select the Authors column.
Go to the Data tab and select Text to Columns in the Data Tools section.
Select Delimited and then click Next.
Uncheck the box for Tab and select Semicolon instead. Click Next.
Leave General selected on the next screen and click Finish.
Creating First and Last Authors Column
Creating First and Last Authors Column
Rename the first author column to First Authors. These are the first authors.
Create an empty column called Last Authors at the end of your spreadsheet. Depending on your data, you may have to scroll very far to the right in order to locate the first empty column in your spreadsheet. It should be adjacent to a column that has an author.
The following Excel formula identifies the last authors. It works by finding the last non-blank cell in the row and puts that value into the Last Authors column. When the second column of your spreadsheet is blank, that indicates that the article only has one author, so the formula will not put anything into the Last Authors column.
Copy and paste this formula into row 2 of the Last Authors column, inserting the relevant cell location for your spreadsheet (the column to the left of your empty Last Authors column). The B2 and A2 cell locations should be the same for your data.
For example, if your Last Authors column is BU, you would edit the YOURCOLUMNHERE placeholder to BT, so that the formula analyzes all available authors in that row. That would look like: =IF(B2<>"",LOOKUP(2,1/(A2:BT2<>""),A2:BT2),"")
To copy the formula into the remaining rows, use the Fill command.
Select the cell with the formula and the remaining cells in the Last Authors column.
Go to the Home tab and click Fill in the Editing section. Select the Down option.
The Last Authors column should now list all the last authors. Before deleting the middle authors, replace the formula with its values in the Last Authors column, so that formula’s output doesn’t change.
Select the Last Authors column.
Copy the cells: Control + C
Go to the Home tab. From the Paste menu select Paste values.
Now you can delete the middle author columns without affecting the formula’s output.
There should now be two columns.
Copy and paste the First Authors column, including the header row, into its own spreadsheet and save it as first_authors_year.
Copy and paste the Last Authors column, including the header row, into its own spreadsheet and save it as last_authors_year.
Cleaning Author Names
Cleaning Author Names
Separate first and last names.
Select the column containing author names.
Go to the Data tab and select Text to Columns in the Data Tools section.
Select Delimited then click Next.
Change the default settings and select Comma. Click Next.
Leave General selected on the next screen and click Finish.
Rename columns to Last Name and First Name.
You may wish to remove middle initials from names, as this may decrease the name-gender probability match.
Select the First Name column on the right.
Go to the Data tab and select Text to Columns in the Data Tools section.
Select Delimited then click Next.
Change the default settings and select Space. Click Next.
Leave General selected on the next screen and click Finish.
This removes leading whitespace and separates out the middle initials into their own column. Remove the empty whitespace column and rename the First Name and Last Name columns if necessary.
Use the Replace option in the Find & Select dropdown menu to make corrections based on the special characters and incorrect punctuation marks that you identified.
Remove any blank rows from your last_authors_year file. Publications that did not have a last author account for these blank rows. If there are any blank rows in your file, Namsor will include them when processing the credit charge.
Refer to Step 11 of the Removing Blank Rows section for instructions.
Uploading Data to Namsor
Uploading Data to Namsor
Register for a free account or log in to an existing one. Then click Excel/CSV at the top of Namsor's homepage to upload your file. Either drag and drop your file or select a local file on your computer.
Once your file is uploaded, click Validate files.
Choose the feature. Click the select a feature dropdown menu and select Genderize Name: gender from first name, last name (optional). Once you’ve selected a feature you can view the feature info and feature parameters.
The cost for this analysis is 1 credit per name.
Click Choose this feature below the Feature info and Feature parameters sections to continue.
Choose your settings in these next steps.
Select Keep existing columns under Global settings.
Select The file has a header and confirm that First row = 2.
Identify the first name column and last name column under Column settings. This is based upon the data in your spreadsheet.
Click Validate settings.
Review your data before processing your file.
Check that all the settings made in the previous steps are correct.
Review the summary chart and click The data looks good.
Review the summary and credit cost on the next screen.
Check your credit balance.
Click Process file.
Download the generated file. Look for the column likelyGender to review the name-gender matches. To learn more about the data in the remaining columns, such as genderScale, score, and probabilityCalibrated, see Namsor's API Documentation.