Python Application Notes: pathname manipulations

I usually process simulation results in batch, which is generally associated with pathname manipulations. In this article, I take notes of pathname manipulations from my programming experiences.

1. Overview

1.1 Terms related to pathname

The descriptions of the terms related to pathname are given below, excerpting from Wikipedia.

pathname

A path, the general form of the name of a file or directory, specifies a unique location in a file system. A path points to a file system location by following the directory tree hierarchy expressed in a string of characters in which path components, separated by a delimiting character, represent each directory.

dirname

dirname is a standard UNIX computer program. dirname will retrieve the directory-path name from a pathname ignoring any trailing slashes.

basename

basename is a standard UNIX computer program. basename will retrieve the last name from a pathname ignoring any trailing slashes.

Note that the result of os.path.basename(path) is different from the Unix basename program where basename for '/foo/bar/' returns 'bar', the os.path.basename function returns an empty string ''.

filename

Sometimes “filename” is used to mean the entire name, such as the Windows name c:\directory\myfile.txt. Sometimes, it will be used to refer to the components, so the filename in this case would be myfile.txt. Sometimes, it is a reference that excludes an extension, so the filename would be just myfile.

filename extension

A filename extension (such as txt) is an identifier specified as a suffix to the name of a computer file. The extension indicates a characteristic of the file contents or its intended use. A file extension is typically delimited from the filename with a full stop (period).

1.2 Python modules

Three of the most commonly used Python modules for pathname manipulations are listed below.

os.path

Common pathname manipulations. This module implements some useful functions on pathnames. To read or write files see open(), and for accessing the filesystem see the os module. The path parameters can be passed as either strings, or bytes. 

glob

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

pathlib

Object-oriented filesystem paths. This module offers classes representing filesystem paths with semantics appropriate for different operating systems.

Path classes are divided between pure paths (which provide path-handling operations which don’t actually access a filesystem.) and concrete paths (which inherit from pure paths but also provide methods to do system calls on path objects.)

2. os.path

2.1 split and join

A full file path (e.g., /Users/sparkandshine/Documents/main.py) is composed of two components, which are,

  • directory name (/Users/sparkandshine/Documents in this case). This is the first element of the pair returned by os.path.split(path).
  • base name (main.py in this case). This is the second element of the pair returned by os.path.split(path).

os.path.split(path) splits the pathname path into a pair, (head, tail) where tail is the last pathname component and head is everything leading up to that.

If path ends in a slash, tail will be empty. If there is no slash in path, head will be empty. If path is empty, both head and tail are empty. The tail part will never contain a slash. Trailing slashes are stripped from head unless it is the root (one or more slashes only).

import os

>>> os.path.split('/Users/sparkandshine/Documents/main.py')
('/Users/sparkandshine/Documents', 'main.py')
>>> os.path.split('/Users/sparkandshine/Documents')
('/Users/sparkandshine', 'Documents')

>>> os.path.split('/Users/sparkandshine/Documents/')    # path ends in a slash
('/Users/sparkandshine/Documents', '')
>>> os.path.split('main.py')                            # no slash in path
('', 'main.py')
>>> os.path.split('')                                   # path is empty
('', '')
>>> os.path.split('/')                                  # root
('/', '')


os.path.splitext(path)  # Split the pathname path into a pair (root, ext) such that `root + ext == path`
>>> os.path.splitext('/Users/sparkandshine/Documents/main.py')
('/Users/sparkandshine/Documents/main', '.py')

os.path.join(path, *paths) joins one or more path components intelligently.

The return value is the concatenation of path and any members of *paths with exactly one directory separator (os.sep) following each non-empty part except the last, meaning that the result will only end in a separator if the last part is empty. If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.

>>> os.path.join('/Users/sparkandshine', 'Documents', 'main.py')
'/Users/sparkandshine/Documents/main.py'

# the last part is empty
>>> os.path.join('/Users/sparkandshine', 'Documents', '')           
'/Users/sparkandshine/Documents/'

# a component is an absolute path
>>> os.path.join('/Users/sparkandshine', '/home/sparkandshine', 'Documents', 'main.py')
'/home/sparkandshine/Documents/main.py'

os.path.join(head, rail) can be regarded as the inverse operation of os.path.split(path). In all cases, join(head, tail) returns a path to the same location as path (but the strings may differ).

2.2 Basic use

Some of the most commonly used functions are listed below.

os.path.basename(path)  # Return the base name of pathname path, the second element returned by `split(path)`. 
os.path.dirname(path)   # Return the directory name of pathname path, the first element returned by `split(path)`. 

os.path.exists(path)    # Return True if path refers to an existing path or an open file descriptor.

os.path.abspath(path)   # Return a normalized absolutized version of the pathname path. 
os.path.isabs(path)     # Return True if path is an absolute pathname.
os.path.normpath(path)  # Normalize a pathname by collapsing redundant separators and up-level references.

os.path.getatime(path)  # Return the time of last access of path.
os.path.getmtime(path)  # Return the time of last modification of path. 
os.path.getctime(path)  # Return the system’s ctime which, on some systems (like Unix) is the time of the last metadata change, and, on others (like Windows), is the creation time for path. 

os.path.getsize(path)   # Return the size, in bytes, of path. 


os.path.isfile(path)    # Return True if path is an existing regular file.
os.path.isdir(path)     # Return True if path is an existing directory. 
os.path.islink(path)    # Return True if path refers to a directory entry that is a symbolic link.  
os.path.ismount(path)   # Return True if pathname path is a mount point.

os.path.samefile(path1, path2)  # Return True if both pathname arguments refer to the same file or directory. 
os.path.sameopenfile(fp1, fp2)  # Return True if the file descriptors fp1 and fp2 refer to the same file.
os.path.samestat(stat1, stat2)  # Return True if the stat tuples stat1 and stat2 refer to the same file.

2.3 Create a directory if doesn’t exist

Create a directory if it doesn’t exist.

subdir = 'msg_events_arbitrary'
if not os.path.exists(subdir):
    os.makedirs(subdir)

os.mkdir creates a directory with a numeric mode. Further, [os.makedirs(https://docs.python.org/3/library/os.html#os.makedirs) is a recursive directory creation function. Like os.mkdir(), but makes all intermediate-level directories needed to contain the leaf directory.

# Python2
os.mkdir(path[, mode])
os.makedirs(path[, mode])

# Python3
os.mkdir(path, mode=0o777, *, dir_fd=None)  
os.makedirs(name, mode=0o777, exist_ok=False)   # If exist_ok is False (the default), an OSError is raised if the target directory already exists.

3. glob

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. *, ?, and character ranges expressed with [] will be correctly matched. glob treats filenames beginning with a dot . as special cases.

If recursive is true, the pattern “**” will match any files and zero or more directories and subdirectories. If the pattern is followed by an os.sep (such as /), only directories and subdirectories match. (New in Python 3.5+)

glob.glob(pathname, *, recursive=False) # Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. 

glob.iglob(pathname, recursive=False)   # Return an iterator which yields the same values as glob() without actually storing them all simultaneously.

glob.escape(pathname)                   # Escape all special characters ('?', '*' and '['). This is useful to match a string containing special characters.

4. pathlib

pathlib offers classes representing object-oriented filesystem paths. It is new since Python 3.4.

4.1 Pure paths

Pure path objects provide path-handling operations which don’t actually access a filesystem.

import pathlib

class pathlib.PurePath(*pathsegments)
class pathlib.PurePosixPath(*pathsegments)
class pathlib.PureWindowsPath(*pathsegments)

# Examples
>>> p = pathlib.PurePath('subdir', 'subdir_main.py')
>>> p
PurePosixPath('subdir/subdir_main.py')

>>> p.parts
('subdir', 'subdir_main.py')
>>> p.parent
PurePosixPath('subdir')
>>> p.suffix
'.py'
>>> p.stem
'subdir_main'

4.2 Concrete paths

Concrete paths are subclasses of the pure path classes. In addition to operations provided by the latter, they also provide methods to do system calls on path objects.

import pathlib

class pathlib.Path(*pathsegments)
class pathlib.PosixPath(*pathsegments)
class pathlib.WindowsPath(*pathsegments)

# Examples
from pathlib import Path    # Import the main class
p = Path('.')               # Create an instance
subdirectories = [x for x in p.iterdir() if x.is_dir()] # # Listing subdirectories

q = Path('stackoverflow.py')
>>> q.exists()
True
>>> q.cwd()
PosixPath('/Users/sparkandshine/git/tmp')
>>> q.home()
PosixPath('/Users/sparkandshine')
>>> q.stat()
os.stat_result(st_mode=33252, st_ino=4554446, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=160, st_atime=1485886672, st_mtime=1485886672, st_ctime=1485886672)

with q.open() as f:         # open a file
    lines = f.readline()

5. Find all files ending with an extension

Use glob

import glob

files = [pathname for pathname in glob.glob('*.py')]
['main.py', 'stackoverflow.py']

files = [pathname for pathname in glob.glob('/Users/sparkandshine/git/tmp/*.py')]
# ['/Users/sparkandshine/git/tmp/main.py', '/Users/sparkandshine/git/tmp/stackoverflow.py']

files = [pathname for pathname in glob.glob('*.py')]
['main.py', 'stackoverflow.py']

files = [pathname for pathname in glob.glob('**/*.py', recursive=True)] # Python 3.5+
# ['main.py', 'stackoverflow.py', 'subdir/subdir_main.py']

Use os.walk

os.walk generates the file names in a directory tree by walking the tree. It yields a 3-tuple (dirpath, dirnames, filenames).

import os

os.walk(top, topdown=True, onerror=None, followlinks=False)

for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith(".py"):
             print(os.path.join(root, file))

# Output
./main.py
./stackoverflow.py
./subdir/subdir_main.py   

Use pathlib

>>> from pathlib import Path
>>> p = Path('.')
>>> list(p.glob('**/*.py'))
[PosixPath('main.py'), PosixPath('stackoverflow.py'), PosixPath('subdir/subdir_main.py')]

Use os.listdir

os.listdir(path='.') returns a list containing the names of the entries in the directory given by path.

import os

# os.listdir(path='.') returns a list containing the names of the entries in the directory given by path.

files = [file for file in os.listdir('.') if file.endswith(".py")]
# ['main.py', 'stackoverflow.py']

References:
[1] StackOverflow: Find all files in directory with extension .txt in Python
[2] Python documentation

发表评论

电子邮件地址不会被公开。 必填项已用*标注