185x140Header
Home > Data File Basics

Introduction

To use data is to work at the level of the unit of analysis or observation. Presumably, you want to work at the level of the unit of analysis/observation because you wish to create, verify or test a statistic.

Your efforts will be concentrated on finding data, interpreting its structure with a codebook, converting it to an appropriate format for analysis and performing the desired analysis.

For most, the "Introduction to Data" occurs in the context of a course in statistical analysis where the data you have used has been preformatted for use with a statistical package such as SPSS, SAS or STATA. Once you begin your own research, you will discover that few datasets come prepared for you in the same way. In fact, you will find that getting your data in an appropriate form for analysis is often more involved than the analysis itself!
Source: Data Library-- Introduction To Data Handling

Why? Because a given file has two "formats" in play: the data file format and data file format. The first refers to the logical structure of data inside the file (e.g. a "rectangular" file versus a "hierarchical" file). The second refers to the type of file independent of its contents (an ASCII file versus an MS Excel file).

Find Data with our list of Data Sources

This link will take you to a list of the major data sources on the net. They typically provide users with some combination of data discovery, data ordering/downloading, lists of links, documentation and online analysis.

Examining Files

Before you do any work with a file, it's worthwhile to verify its contents and their integrity. Utilities that can help you do this include word/line/character counters, multi-format viewers and file type guessers. See File Examination & Viewing Utilities for a list of basic programs. Some are free, some are shareware and some can be used in a UNIX environment only. This will vary from department to department, so if you don't know what your options are, contact your department about available computing resources.

Use the Codebook

You might be wondering how you would know what to look for in a file when you've never used it before and you're trying to verify its contents and integrity. That and everything else you need to know should be in the codebook. Typical elements include:

Work With the Files

As noted in the introduction, getting a data file isn't merely a matter of finding what content you want. That's only step one. Step two is getting that file into a usable form. You may have to negotiate the file format, the file size, the file transfer and the file conversion.

Compression

Compression is the coding of data to save storage space or transmission time. Although data is already coded in digital form for computer processing, it can often be coded more efficiently (using fewer bits). There are many compression algorithms and utilities. Compressed data must be decompressed before it can be used.

The standard Unix compression utilty is called compress though GNU's superior gzip has largely replaced it. Other compression utilties include zip, PKZIP, Stuffit and WinZip.

ICPSR generally uses Gzip which has the file extension ".gz". For additional compression software and notes on file extensions see the Compression FAQ, particularly "What is this .xxx file type? Where can I find the corresponding compression program?" Source: Free Online Dictionary of Computing Some content added by GPL.

Conversion

There are a number of software applications that can be used to analyze a data file. Often, user A will create a data file in a format specific to software application 1 and user B needs it formatted for application 2. As a result, a market has appeared for still other software applications that can do the conversion from 1 to 2. The link above goes to a list of coversion tools.

Transferring

File transfer is the movement of one or more files from one location to another. A collection of electronically-stored files can be moved by physically moving the electronic storage medium, such as a computer diskette, hard disk, or compact disk from one place to another or by sending the files over a telecommunications medium. On the Internet, the File Transfer Protocol (FTP) is a common way to transfer a single file or a relatively small number of files from one computer to another. For larger file transfers (a single large file or a large collection of files), file compression and aggregation into a single archive is commonly used. (A zip file is a popular implemention.) Source: searchNetworking.com

"Download" and "upload" are commonly used to denote file transfer as well as "ftp" (used as a noun and verb) even though these are technically not interchangeable terms.

Do the Analysis

There are resources to help you learn how to do analysis and free online software. See the links below for specifics:

About Data Analysis

About Specific Programs and Languages: